Object Recognition Notes

crumcasteΤεχνίτη Νοημοσύνη και Ρομποτική

17 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

99 εμφανίσεις

These notes are based on the following sources:

Shimon Edelman and Nathan Intrator: Visual Processing of Object Structure, preliminary draft
of an article to appear in
The Handbook of Brain Theory and Neural Networks

(2nd ed.), M. A. Arbib,
ed., MIT Press,

2002.

Guy Wallis and Heinrich Bülthoff: Object recognition, neurophysiology,
preliminary draft of
an article to appear in
The Handbook of Brain Theory and Neural Networks

(2nd ed.), M. A. Arbib,
ed., MIT Press, 2002.

Simon Thorpe and Michèle Fabre
-
Thorpe:

Fast Visual Processing and its implications,
preliminary draft of an article to appear in
The Handbook of Brain Theory and Neural Networks

(2nd
ed.), M. A. Arbib, ed., MIT Press, 2002.


Introduction

Everyday experience tells us that our visual systems are

very fast. In the 1970s, experiments
using Rapid Serial Visual Presentation (RSVP) techniques showed that humans are remarkably
good at following sequences of unrelated images presented at rates of up to 10 frames a second
(Intraub 1999, P
otter 1999)
, an ability frequently exploited by producers of video clips. But the
fact that we can handle a new image every 100 ms or so does not necessarily mean than visual
processing can be completed in this time. As computer chip designers know, proce
ssing rates can
be improved by using pipelining in which several computational steps can operate
simultaneously one after the other. So, how can we determine the time it takes for the visual
system to process a scene? And how can we use this information to

help constrain our models of
how the brain computes? These are some of the issues that we will address in this chapter.

Interestingly, temporal constraints were one of the prime motivations for the development of
connectionist and PDP modeling in the earl
y 1980s. Around this time, Jerry Feldman proposed
the so
-
called 100
-
step limit. He argued that since many high level cognitive tasks can be
performed in about half a second, and since the interspike interval for cortical neurons is seldom
shorter than 5 ms
, the underlying algorithms should involve no more than about 100 sequential,
though massively parallel, steps. Note, however, that the values used by Feldman were only
rough estimates unrelated to any particular processing task. In this chapter, we will r
eview more
specific experimental data on processing speed before looking at how these temporal constraints
can be used to refine models of neural computation.

Behavioral measures of processing speed

The ultimate test for processing speed lies in behavior.
If animals can reliably make
appropriate behavioral responses to a given category of visual stimulus with a particular reaction
time, there can be no argument about whether the processing has been done. Thus, if a fly can
react to displacements of the visu
al world by a change in wing torque 30 ms later, it is clear that
30 ms is enough for both visual processing and motor execution. Fast behavioral reactions are not
limited to insects. For example, tracking eye movements are initiated within 70
-
80 ms in hum
ans
and in around 50 ms in monkeys
(Kawano 1999)
, and the vergence eye movements required to
keep objects within the fixation plane have latencies around 85 ms in humans and under 60 ms in
monkeys
(Miles 1997)
. Such low va
lues probably reflect the relatively simple visual processing
needed to detect stimulus movement and the short path lengths seen in the oculomotor system.
How fast could behavioral responses be in tasks that require more sophisticated visual
processing?

0
2
0
0
4
0
0
6
0
0
8
0
0
1
0
0
0
1
2
0
0
1
4
0
0
0
2
0
0
4
0
0
6
0
0
8
0
0
1
0
0
0
R
e
a
c
t
i
o
n

T
i
m
e
D
i
s
t
r
a
c
t
o
r
s
T
a
r
g
e
t
s
E
R
P

d
i
f
f
e
r
e
n
c
e

o
n
s
e
t
A n i m a
l
N o n - a n i m a
l
Difference
Mean of 15 subjects
-
6
6
µV
100
200
300
ms
A.
B.
M
i
n
i
m
u
m

R
e
s
p
o
n
s
e
T
i
m
e

Figure 1: A. Reaction Time distributions in a go/no
-
go scene categorization task. Statistically
significant differences between the responses to targets and distractors start at the minimum response time
of approximately 250 ms. B. Differential ERP respo
nses to targets and non
-
targets in the same task. The
ERP difference starts at about 150 ms
(Thorpe et al 1996)
.

In 1996, we reported results using a task that is a major challenge to the visual system
(Thorpe
et al 1996)
.

Subjects were presented with color photographs flashed for only 20 ms, and asked to
respond as quickly as possible if the image contained an animal. The images were extremely
varied, with targets that included mammals, birds, fish, insects in their natura
l environments,
and the distractors were also very diverse. Furthermore, no image was shown more than once,
forcing subjects to process each image from scratch with minimal contextual help. Despite all
these constraints, accuracy was high (around 94%) with

mean reaction times (RTs) typically
around 400 ms.

While mean RT might be the obvious candidate for measuring processing speed, another
useful value is the minimal time needed to complete the task. Figure 1A plots separately RT
distributions for correct r
esponses to targets and for incorrect responses to distractors in the
animal/non
-
animal task. Since targets and distractors were equally probable, the first time bin at
which correct responses start to significantly outnumber incorrect ones defines the min
imal
response time. Responses at earlier times with no bias towards targets are presumably
anticipations triggered before stimulus categorization was completed. Remarkably, in the
animal/non
-
animal categorization task, these minimal response times can be u
nder 250 ms.

It might be thought that the images that trigger particularly short reaction times constitute a
sub
-
population of particularly easy images. However, we found no obvious features that
characterized rapidly categorized images
(F
abre
-
Thorpe et al 2001)
. In other words, even with
highly varied and unpredictable images, the human visual system is capable of completing the
processing sequence that stretches from the activation of the retinal photoreceptors to moving the
hand in unde
r 250 ms.

Humans can perform this challenging visual task quickly, but intriguingly, rhesus monkeys
are even faster. In monkeys, minimal RTs to previously unseen animal targets are as low as 170
-
180 ms
(Fabre
-
Thorpe et al 1998)
. As in the

tracking and vergence eye movement studies
mentioned earlier, it appears that humans take nearly 50% longer than their monkey cousins to
perform a given task.

Such data clearly impose an upper limit on the time needed for visual processing. However,
they
do not directly reveal how long visual processing takes because the times obviously also
include response execution. How much time should we allow for the motor part of the task?
Although behavioral methods alone are unable to answer such questions, electr
ophysiological
data from single unit recording and ERP or MEG studies can be used to track information
processing between stimulus and response.

Single cell recordings and processing speed

Single unit activity is perhaps the easiest method to use since in
dividual spikes are rather like
behavioral responses and the same technique of searching for the minimal latency at which
differential responses occur can be applied. If a neuron in monkey inferotemporal cortex
responds selectively (i.e. differentially) to

faces at a latency of 80
-
100 ms post
-
stimulus, then it
follows that at least some forms of face processing can be completed by this time. By examining
the sorts of information that can be derived from differential spiking activity at different times
and i
n different visual structures, one can follow how processing develops over time.

Surprisingly, the use of response latency to track the time course of visual processing is a
relatively recent technique in experimental neuroscience. Nevertheless, by 1989 it

was clear that
the onset latencies of selective visual responses in brain structures along the visual pathway
were a major constraint on models
(Thorpe & Imbert 1989)
. Face
-
selective neurons had been
described in monkey inferotemporal co
rtex with typical onset latencies around 100 ms and,
beyond the visual system as such, it was known that neurons in the lateral hypothalamus could
respond selectively to food with a latency of 150 ms. Although these earlier studies suggested
that visual pr
ocessing could be very fast, they did not specifically determine at which point the
neuronal response was fully selective. This issue was dealt with in 1992 when it was shown that
even the first 5 ms of the response of neurons in monkey inferotemporal cort
ex could be highly
selective to faces
(Oram & Perrett 1992)
. Thus, by determining the earliest point at which a
particular form of stimulus specificity can be seen in the visual pathway, it can be possible to
assign firm limits on the proc
essing time required to reach a certain level of analysis.

Before leaving our discussion of single unit responses, we should mention another approach
to measuring processing speed, directly inspired by the behavioral RSVP studies mentioned
earlier. Keysers

et al recently looked at how face
-
selective neurons in the monkey temporal lobe
respond to sequences of images presented at high rates. By varying the effective frame rate at
which the images were presented, they found that although the strength of the re
sponse
decreased when frame
-
rate was increased, the neurons were still being clearly specifically driven
by the stimulus when the image was changed every 14 ms, i.e. at a frame rate of 72 Hz
(Keysers et
al 2001)
. This very impressive abili
ty to follow rapidly changing inputs is one of the hallmarks of
feed
-
forward pipeline processing, a point we will return to later.

ERP or MEG data and processing speed.

Event
-
Related Potentials and Magnetoencephalography can also be very informative altho
ugh
it is less easy to be sure about the precise start of the neuronal response than with single unit
data. Furthermore, signals recorded from a particular site on the scalp can be influenced by
activity from a very large number of neurons making it diffic
ult to localize their source with
precision. However, by looking for the earliest times at which the response varies in a systematic
way according to a given characteristic of the input, we can determine the minimal time it takes
to process it.

For exampl
e, in subjects performing the animal/non
-
animal categorization task described
earlier, simultaneously recorded ERP recordings showed that if one averages together the
response for all correct target trials and compares the traces with the average response
to all
correctly categorized distractors, the two curves coincide almost perfectly until about 150 ms
post
-
stimulus, at which point they diverge dramatically
-

see figure 1B
(Thorpe et al 1996)
. This
differential ERP response, which appear
s to be specifically related to target detection, is
remarkably robust and occurs well in advance of even the fastest behavioral responses. A value
of 150 ms for this initial rapid form of visual processing leaves no more than 100 ms for motor
execution wh
en behavioral reactions occur at around 250 ms.

Some more recent studies have reported differential category specific activation at even earlier
latencies. For example, differential activity specific to gender has been reported to start as early as
45
-
85 m
s post stimulus
(Mouchetant
-
Rostaing et al 2000)
. However, it might be that such early
differential activity should be interpreted more in terms of low
-
level statistical differences
between different categories of stimuli, rather than mark
ing the decision that a particular
category is present. This point is made clear in a study that used two different categorization
tasks with the same sets of images. The images were either animals, means of transport, or other
varied distractor images, bu
t the target category varied from block to block. By averaging ERP
activity appropriately, it was possible to demonstrate that early ERP differences (between 75 and
120 ms) could be explained by statistically significant differences between processing for
the two
types of image. In contrast, the differential activity starting at around 150 ms was clearly related
to the processing of the image as a target and not to its physical characteristics
(VanRullen &
Thorpe 2001)
.

This rapid, and very

incomplete, review has hopefully shown how behavioral and
electrophysiological data can be used to define temporal constraints that can be applied to
particular sensory processing tasks. In the remainder of this chapter we will discuss how such
data can b
e used to constrain the underlying processing algorithms.

Implications for computational models

The ability to determine the minimal time required to perform a particular task or
computation is not, by itself, enough to constrain models of the underlying m
echanisms. For
example, we know that neurons in primate inferotemporal cortex can respond selectively to faces
with latencies of 80
-
100 ms. But the computational implications of this fact only become clear
when one takes into account the number of processi
ng steps involved and some details of the
underlying physiology. As pointed out in the late 1980s
(Thorpe & Imbert 1989)
, information
reaching Anterior Inferior Temporal cortex (AIT) in 80
-
100 ms presumably has to go through the
retina and

the lateral geniculate as well as cortical areas V1, V2, V4 and the Posterior Inferior
Temporal cortex (PIT). While only one synaptic relay is required to pass through the geniculate, it
is unlikely that afferents reaching cortical areas will make signifi
cant direct connections onto
output neurons, meaning that at least two synaptic relays are involved at each cortical stage. This
means that the minimal path length from retina to AIT involves probably at least 10 successive
steps, implying that at each sta
ge processing must be done within about 10 ms. Given that firing
rates of cortical neurons only rarely exceed 100 spikes.s
-
1
, very few neurons will have time to fire
more that one spike in this 10 ms processing window. Such constraints severely limit the
p
ossibilities for using iterative processing loops but also question the feasibility of using
conventional firing
-
rate based coding strategies

While one can always raise doubts concerning the functional significance of face
-
selective
neuronal responses at 1
00 ms, there can be little ambiguity when one takes into account the fact
that monkeys can produce reliable manual responses from as early as 170 ms post
-
stimulus onset
in a challenging high level visual categorization task. If we suppose that high order v
isual areas
such as AIT are indeed involved in such tasks, we need to propose a route by which activation
could pass from the temporal lobe to the hand. This is not a trivial problem, since the temporal
lobe does not have direct connections to motor output
s. Figure 2 shows one possible route, via
prefrontal and premotor cortex. It also stresses the point that at least in the case of the earliest
behavioral responses, the time available for processing at each stage is so limited that there is
very little ti
me for anything other than a feed
-
forward pass.


Figure 2: A possible input
-
output pathway for performing go/no
-
go visual categorization tasks in the
monkey. Information passes from the retina to the lateral geniculate nucleus (LGN) before arriving in
cor
tical area V1. Further processing occurs in areas V2 and V4 and the posterior and anterior
inferotemporal cortex (PIT and AIT) before being relayed to the prefrontal cortex (PFC), premotor (PMC)
and motor cortices (MC). Finally, activation of motoneurons i
n the spinal cord triggers movement of the
hand. For each area, the two numbers provide approximate values for(i) the latency of the earliest
responses, and (ii) a more typical average response latency.

Some of the other data mentioned earlier also point
s towards the notion of the rapid visual
processing being largely feed
-
forward. We noted that following three
-
weeks of training, there
was no evidence that very familiar images could be processed faster than previously unseen
ones. Such a "floor effect" fo
r visual processing speed is one of the hallmarks of feed
-
forward
processing.

Other arguments in favor of feed
-
forward processing come from studies on the ability of
neurons in IT to follow very rapidly changing inputs. While it would not be that surprisin
g to
learn that neurons close to the sensory input can follow rapid input changes, the fact that neurons
so far into the brain can still modulate their responses to inputs that last only 14 ms is strong
evidence that their selectivity does not depend on le
ngthy recurrent processing at previous
stages. Finally, all the studies that show the full selectivity of the very initial part of the neuronal
response should be considered as arguing in favor of feed
-
forward mechanisms.

Distinguishing feed
-
forward and r
ecurrent processing

We have seen that behavioral and electrophysiological data provide strong evidence in favor
of the idea that at least some forms of visual processing can be achieved on the basis of a feed
-
forward pass. Indeed, the behavioral data on Ul
tra
-
Rapid Visual Categorization implies that even
tasks involving high
-
level superordinate categorization of complex natural scenes can be realized
in this way. Nevertheless, it needs to be stressed that this should not be taken to imply that vision
is "do
ne" in 150 ms. Many important aspects of vision must rely on extensive recurrent
processing. So how can temporal analysis be used to determine which aspects of vision can be
done using a feed
-
forward pass, and which require more time
-
consuming iterative me
chanisms?

Let us return for a moment to the question of the dynamics of single cell selectivity, raised
earlier. We argued that many forms of selectivity are present right from the very beginning of the
neuronal response (for example, orientation selectivi
ty, and selectivity to faces). However, this
does not mean that neuronal selectivity is fixed. For example, the orientation tuning of neurons in
V1 can fluctuate considerably during the course of the
(Ringach et al 1997)
, a result consiste
nt
with the idea that neuronal properties are dynamic and constantly under the influence of
recurrent connections. However, even more interesting are reports that certain neuronal
properties appear only later during the time course of the response. For exa
mple, a number of
recent studies have looked at the time course of processing related to perceptual processes such
as texture segmentation and filling
-
in at the level of primary visual cortex and have reported that
such phenomena can often take several ten
s of milliseconds to develop
(Lamme et al 1998)
.
Another example is inferotemporal cortex, where it has been reported that while certain
attributes of the face were encoded at the very start of the response (around 100 ms post
-
stimulus),
other attributes such as face identity or expressions were only encoded from about 150
ms post
-
stimulus
(Sugase et al 1999)
.

Detailed temporal analysis of the time course of neuronal selectivity can thus allow processing
delays to be atta
ched to processing of particular visual characteristics, and it looks likely that
while some aspects of the visual input can be derived from the very beginning of the neural
response, suggesting that they can be derived from mainly feed
-
forward processing,

other
aspects of the input take time to analyse and presumably involve recurrent computation.


Final Comments

Artificial Neural Networks are typically divided into two main types
-

feed
-
forward networks
(which include Multi
-
layer Perceptrons and most Back
-
propagation networks), and recurrent
networks in which processing loops can occur (see Figure 1). In biological systems, it is relatively
rare to find sensory pathways that are anatomically purely feed
-
forward. For example, in the
primate visual system,
only the retina is not affected by feedback connections coming from the
thalamus. All the other levels (LGN, V1, V2, V4 etc) have extensive feedback connections and it is
widely believed that all visual processing is a complex interaction of bottom
-
up prop
agation and
top
-
down interactive feedback. However, even in a sensory pathway in which top
-
down
connections greatly outnumber bottom
-
up ones, a distinction between feed
-
forward and
recurrent processing can still be made if one examines the temporal dynamic
s of the response. In
this chapter we have discussed both behavioral and electrophysiological data that indicates that
at least some forms of sophisticated visual processing can be achieved on the basis of the initial
feed
-
forward pass. But of course this
should not be considered as being the end of visual
processing. Indeed, the results of this rapid first pass can be used to improve the efficiency of
demanding processes such as image segmentation , an extremely difficult task using purely
image based tech
niques. Suppose, for example, that the initial feed
-
forward pass was able to
locate the presence of an eye at a particular point in the retinal image. This information could be
sufficient to trigger a behavioral response in a task for which an eye is a di
agnostic feature. But
the same information could be extremely useful in guiding segmentation processes occurring at
lower levels since knowing that an eye is present implies that a face is probably also present.

Note how the present formulation contrasts
with a more classic hierarchical processing model
in which visual processing is divided into several steps and where each step needs to be
completed before the next step can be initiated. For example, in many current image
-
processing
models scene segmentat
ion has to occur before the mechanisms of object recognition can start.
But the alternative proposed here is that despite the multi
-
layered natural of the visual system, a
very fast feed
-
forward pass can act as a seeding process that can allow subsequent p
rocessing to
be performed in an intelligent top
-
down way.

Finally, it should be stressed that the existence of a very fast feed
-
forward mode of processing
has major consequences for our understanding of brain computation. Although space does not
allow us
to discuss this point in detail, the fact that computation can be done under conditions
where most neurons only get to fire one spike causes serious problems for the conventional view
that most if not all information is transmitted in the form of a rate co
de. In particular, it forces us
to look for other alternative coding schemes that are compatible with a single spike processing
mode.



Figure WB1: Principle divisions of neocortex, including the main areas of the temporal lobe. Light arrows
indicate info
rmation flow up along the dorsal stream. Dark arrows indicate flow along the ventral stream.

The functional divisions of neocortex

From the occipital lobe, information flows down into the temporal lobe, forming the lower
(ventral) stream; and up into the
parietal lobe, forming the upper (dorsal) stream (young92b
-

see
figure WB1).

The classical symptoms of patients with
parietal lobe

lesions are a good ability to recognize
and name objects, but a poor ability to integrate them into a scene. Patients often

suffer from
neglect in specific areas of the visual field, being unaware of certain elements of the scene before
them. In extreme cases an entire hemifield can be largely ignored, leading to curious phenomena,
such as a failure to eat food from one half o
f a plate. Visuo
-
spatial neglect and scene
understanding problems are attributed to an inability to motivate shifts in attention and direction
of gaze throughout a scene, leading to speculation that the parietal lobe is involved in guiding
attention and ey
e
-
movements (farah90).

Damage to the
temporal lobe
, on the other hand, is associated with specific types of
recognition agnosias, including problems in naming and categorizing objects such as peoples'
faces (farah90).

The dorsal stream was likened to the

task of deciding "where" an object is, and the ventral
stream "what" an object is. This distinction has been born out by many more recent studies which
have looked at the selectivity of individual neurons in cortex (ungerleider94). The "what" stream
is se
en as the center of object recognition, but an integrated model of scene perception will almost
certainly require a wider reaching approach.

If we look at an airplane it may appear on one level to be a single entity or object, but if our
task is to analyz
e details of the plane it becomes a scene containing a fuselage, wings, a tail
-
plane,
engines and wheels. It may, therefore, be inappropriate to regard scene and object perception as
being subserved by totally separate mechanisms of analysis, and that an u
nderstanding of how
we represent objects may in some part guide models of scene representation. At an abstract level,
scene analysis can be analyzed globally and extremely rapid, providing observers with (what has
been referred to in the literature as) the

"gist" of what they are looking at (rensink00b). Object
recognition system may well perform analysis of a scene at this level. Indeed the rapid processing
of scenes at a global/abstract level has been shown to influence the speed and accuracy with
which o
bjects within the scene are recognized. But how then, are scenes represented at the level
of segregated objects, including relative location information? Recording and modeling results all
point to this being achieved through the response properties of par
ietal lobe neurons, and hence a
full understanding of how scenes are represented must include information stored in the parietal
lobe. Indeed, since gist can affect the basic processes going on in scene analysis such as target
selection, information about
the gist of the scene must be relayed to the parietal lobe from the
temporal lobe.

There are plenty of routes which gist information could take between the temporal and
parietal lobes
-

including directly, via the occipital lobe, or via the frontal lobe.
Later stages of IT
(AIT/CIT) connect to the frontal lobe, whereas earlier ones (CIT/PIT) connect to the parietal lobe.
This functional distinction may well be important in forming a complete picture of inter
-
lobe
interaction. Whilst many functions have bee
n ascribed to the parietal and temporal lobes,
relatively little is know about the function of the frontal lobe. What we do know, is that it acts
[MAA: in part] as a temporary or working memory store (desimone95). It may well turn out that
the frontal lobe

acts as a running store of objects currently being represented within a scene
(rensink00b,logothetis98), providing the final link in the chain.

The ventral stream

The path from primary visual cortex to the temporal lobe is a long one, passing through a
s
many as ten neural areas before reaching the last wholly visual areas just beyond CIT. Neurons'
receptive fields grow larger the further down the processing chain one looks. Neurons in the
occipital lobe might have receptive field sizes up to 4


in V4, b
ut neurons in CIT can be as large as
150

. Recordings from the temporal lobe showed neurons selective for hands and faces. Cells
selective to faces could not be excited by simple visual stimuli, complex non
-
faces, or indeed
every face tested. Many neurons
in the temporal lobe are tolerant to shifts in stimulus position,
changes in viewing angle, image contrast, size/depth, illumination or the spatial frequencies
present in the image (desimone91,logothetis96).

[MAA: Some people (e.g., Biederman) would argue

that the recognition of faces (or other
familiar object families, such as motor cars, for which we recognize parametric variation)
involves very different mechanisms from recognition of objects for which 3D relations are crucial
(as in Biederman's geons,
but other models are possible). You need to be more analytic about the
general challenge of object recognition, and the way in which face recognition is paradigmatic
only for some aspects of object recognition considered more generally.]

Work on how the ce
llular response properties of temporal lobe neurons change over time has
mainly concerned slow, long
-
term learning effects in which continuous exposure to an object
class resulted in changes in the number of neurons selective for that stimulus
(kobatake98
,miyashita93b. However, some studies have focused on almost instantaneous
changes which reflect the speed of behavioral changes measured in human responses. In one
study, two
-
tone images of strongly lit faces were shown to monkeys. Some temporal lobe
neuro
ns which did not respond to any of the two
-
tone faces did so if once exposed to the
standard picture of the face. This accords with findings in humans, who often struggle to
interpret two
-
tone images for the first time, but after seeing the standard pictur
e have no
difficulty interpreting the two
-
tone image, even weeks later.

[MAA: How might this be modeled? I think it's a strong argument for a top
-
down influence
that goes beyond the sort of feedforward model you insist on later.]

A processing hierarchy



Figure WB2: Schematic of convergence in the ventral processing stream. The steady growth in receptive
field size suggests that neurons in one layer of the hierarchy receive input from a select group of neurons in
the preceding layer. The time taken for

the effects of seeing a new visual stimulus increases systematically
through the hierarchy, supporting the notion of a strictly layer by layer structure.

Neurons in the latter regions of the temporal lobe can be thought of as sitting on the top of a
pr
ocessing pyramid
-

see figure WB2. Receptive field size grows steadily larger the further up
this pyramid one looks, and this can be seen as a direct consequence of the convergent,
hierarchical nature of the processing stream. Moreover, delays grow in a co
nsistent manner the
further down the processing chain one looks.

However, there are as many connections running back as there are forward in the ventral
stream, and this is important when one comes to devise models. Some theorists have argued that
they ar
e used in recall, and it is true that the act of remembering visual events causes activity to
spread into primary visual areas. Alternatively they may control visual attention. The very act of
attending to specific regions of our visual environment have be
en shown to facilitate the
processing of signals in that region, which may well be due to selectively raising activity of
neurons along the processing hierarchy which correspond to that visual region. Still other
theorists have proposed an integral role fo
r backward connections in visual processing. Whilst
such models may be required to deal with confusing or low quality images, there is good
evidence that timing constraints prohibit such a model from acting during normal recognition,
due to the sheer numbe
r of neural delays which have to be traversed before information can
travel from the eye to the temporal lobe. For that reason, several theorists have proposed a feed
-
forward architecture (wallis97a,wallis99c). One of the possible roles of the processing h
ierarchy is
to construct ever more complex combinations of features, as a natural continuation of the simple
and complex cells of primary visual cortex. In a simplistic sense one can imagine a hierarchy in
which edges are combined into junctions and juncti
ons into closed contours, contours into
volumes, volumes related into shapes, shapes into objects.

[MAA: Thorpe and Fabre
-
Thorpe have an article on
Fast Visual Processing
. I think feedforward
models are fine for recognizing variations of iconic scenes ("O
h, there's the Eiffel Tower") but not
for more complex scenes which combine contextual cues with the unexpected.]

Wallis argues that the visual system extracts invariances from the environment so as to build
representations which facilitate data abstractio
n and extrapolation, and which also facilitate the
building of compact, detailed memories.

[MAA: (a) Recognition of 3D spatial relationships must surely be crucial to object recognition
generally, even if not for face recognition. (b) How could there not
be data abstraction? But what
are useful hypotheses on what is being abstracted? What are the best models of the process? How
well do they relate to neurophysiological data (e.g., that of Tanaka)? And what forms of
extrapolation do they support, etc.?]

The

highest level of abstraction appears to take place in the superior temporal areas where, it
has been suggested, view invariant cells in STPa cells pool the outputs of view selective AIT cells.
The only problem with such an approach is how disparate views
can be associated by a neuron
seeking to build an invariant representation of an object feature.

Encoding objects in the temporal lobe

Hypothesis: Object encoding is achieved via a distributed scheme in which many hundreds or
thousands of neurons
-

each se
lective for its specific feature
-

would act together to represent an
object. Although many of these features represent only small regions of an object, others appear
to represent an object's outline, or some other global but general property. In addition,

the neural
representation of these features may exhibit invariance to scale and size, something typical of
temporal lobe neurons. (hinton86b.)

[MAA: How could one explain the recognition of a human figure in a complex scene, despite
huge variations in siz
e, pose, clothing, activity, etc., etc.]

Network models of the object recognition

The "feature binding problem": imagine a cell trained to respond to the appearance of a set of
features in, say, a translation invariant way. What is to stop this neuron fr
om responding to novel
rearrangements of these features? For example, a simple feature based recognition scheme would
treat a jumble of two eyes, a mouth and nose as the same as a normal face. Real neurons
responsive to faces are generally not impressed by

such stimuli. Successfully determining the
features present in an invariant manner, whilst still retaining spatial configuration, leads to an
instance of the binding problem. Models which throw away spatial information so as to achieve
translation invaria
nce will run into the problem of "recognizing" rearrangements of the features
triggering recognition. Our current opinion on this is that binding does not become an issue if the
features supporting recognition are built up gradually in spatially localized
regions, as they are in
the successive stages of the ventral stream. By responding to local combinations of neurons co
-
active in the previous layer, a single neuron in the next layer should not learn to respond to some
arbitrary spatial arrangement of the
same features (wallis96d). [MAA: cf Neocognitron in 1e.]

Temporal order as a key to recognition learning

The question still remains as to how neurons learn to treat their preferred feature as the same,
irrespective of its size or location. Indeed, ultimat
ely, one would like to understand how neurons
learn to recognize objects as they undergo non
-
trivial transformations due to changes in, say,
viewing direction or lighting. Hypothesis: under normal viewing conditions, and by approaching
an object, watching
it move, or rotating it in our hand we will receive a consistent associative
signal capable of bringing all of the views of the object together. The discovery that cells in the
temporal lobe become selective to stimuli on the basis of the order in which th
ey appear, rather
than spatial similarity, provides important preliminary support for this idea (miyashita93b,
wallis99a).



A functional characterization of structure processing

any object recognition task, such as identification or categorization, has
at its core a common
operation, namely, the matching of the stimulus against a stored memory trace (Ullman, 1989).

[MAA: Not necessarily. If one is, e.g., counting the number of legs on a creature as part of
recognizing it, then one might be
analyzing
the

stimulus using a stored memory trace, but not
matching it.]

For the structure
-
processing tasks, a key characteristic is the restriction of the spatial scope of
at least some of the operations involved to some fraction of the visual extent of the object or

scene
under consideration. Here are a few examples of structural tasks:



Given two objects, or an object and a class prototype, identify their corresponding regions.
The correspondence here may be based on local shape similarity (find the eyes of a face in

a Cubist painting), or on similar role played by the regions in the global structure (find the
eyes in a smilie icon).




Given an object and an action, identify a region in the object towards which the action can
be directed. Similarities between objects v
is a vis this task are defined functionally (as in
the parallel that can be drawn between the handle of a pan and a door handle: both afford
grasping). [MAA: (a) What are the best references on this? (b)
One issue you seem to
neglect is "recognition for wh
at": E.g., when do I recognize a hand versus a monkey hand
versus a monkey hand of a particular size grasping in a particular way, and how are these
"levels of recognition" related to one another?]




Given an object, describe its structure. This explicitly
structural task arises in the context
of trying to make sense of an unfamiliar object (as in perceiving a hot
-
air balloon, upon
seeing it for the first time, as a pear
-
like shape over a box
-
like one).


Appearance
-
based computational approaches to recognit
ion and categorization, according to
which objects are represented by collections of entire, spatially unanalyzed views (Murase and
Nayar, 1995; Duvdevani
-
Bar and Edelman, 1999).

"Classical" mereological structural decomposition approaches (Biederman, 198
7; Bienenstock
et al., 1997) have the opposite tendency: the recursive symbolic structure they impose on objects
seems too rigid and too elaborate.

Object form processing in computer vision

The specific notion that structural descriptions are to be expre
ssed in terms of volumetric parts
(Binford,1971), was subsequently adopted by (Biederman, 1987), who developed it into a
(psychological) theory of Recognition By Components (RBC): the generic parts, called geons
(generalized cylinders) are bound together b
y categorical relations (Bienenstock et al., 1997). By
virtue of their compositionality, the classical structural descriptions meet the two main challenges
in the processing of structure: the visual system is (i) productive
-

it can deal effectively with
a
potentially infinite set of objects; and (ii) systematic
-

a well
-
defined change in the spatial
configuration of the object (e.g., swapping top and bottom parts) causes a principled change in
the representation (Hadley, 1997; Hummel, 2000b)).

However, th
e requirement that object parts be "crisp" and relations syntactically compositional
is difficult to adhere to in practice. The two implemented systems designed to derive structural
descriptions from raw images (Dickinson et al., 1992; Du and Munck
-
Fairwoo
d, 1995) had
difficulty with extracting sufficiently good line drawings, and with the idealized nature of the
geon representation. Both these problems can be effectively neutralized by giving up the classical
compositional representation of shape by a fixe
d alphabet of crisp primitives (geons) in favor of a
superpositional coarse
-
coding by an open
-
ended set of image fragments. The system described
by Nelson and Selinger (1998) starts by detecting contour segments, then determines whether
their relative arra
ngement approximates that of a model object. Because none of the individual
segment shapes or locations is critical to the successful description of the entire shape, this
method does not suffer from the brittleness associated with the classical structural

description
models of recognition. Moreover, the tolerance to moderate variation in the segment shape and
location data allows it to categorize novel members of familiar object classes.

[MAA: How does one go recursive? I.e., when do loose groupings of fea
tures define salient
subparts to then be loosely related in defining larger objects. Conversely, what about top
-
down
approaches where a crude low
-
spatial frequency analysis of a scene can drive generic pattern
recognition within which one can focus attent
ion to find features which direct finer object
characterization?]


Burl et al. (1998) combines "local photometry" (shape primitives that are approximate
templates for small snippets of images) with "global geometry" (the probabilistic quantification of
spa
tial relations between pairs or triplets of primitives).

Camps et al. (1998) represent objects in terms of appearance
-
based parts (defined as
projections of image fragments onto principal components of stacks of such fragments) and their
approximate relat
ions.

Sali and Ullman (1999) use snippets of images taken from objects to be recognized to represent
these objects; recognition is declared if a sufficient proportion of the fragments are detected, and
if the spatial relations among these conform to the s
tored description of the target. In all these
methods, the interplay of loosely defined local shape ("what") and approximate location
("where") information leads to robust algorithms supporting both recognition and categorization.
We contend that these sam
e methods also provide an effective alternative to the classical
structural description approach to the processing of object form.


Mechanisms implicated in structure processing in primate vision

Although no evidence seems to exist for the neural embodim
ent of geons as such, cells in the
inferotemporal (IT) cortex were reported to exhibit a higher sensitivity to "non
-
accidental" visual
features such as those that define geons than to "metric" properties of the stimuli (Vogels et al.,
2000). Also, an inter
pretation along the classical structural lines has been offered (Biederman,
2000) for the role of the cells in the inferotemporal (IT) cortex that are tuned to very specific
shapes, as determined by the stimulus reduction technique (Tanaka et al., 1991). I
t has been
argued that symbols representing the parts are bound into the proper structure dynamically, by
the synchronous firing of the neurons that code each symbol (von der Malsburg, 1999). Thus, a
mechanism capable of supporting dynamic binding must be
available; it is possible that this
function is fulfilled by the synchronous or phase
-
locked firing of cortical neurons (Gray et al.,
1989; Singer and Gray, 1995), although the status of this phenomenon in primates has been
disputed (Young et al., 1992; Ki
rschfeld, 1995).

An alternative theory proposed by (Edelman and Intrator, 2000b) calls for an open
-
ended set
of fragments instead of geons, and posits binding by retinotopy in addition to activity
synchronization.

[MAA: Is activity synchronization necess
ary for the model, or can it work just by exploiting
binding by retinotopy? Does activity synchronization provide enough distinct phases to keep
separate the different regions we recognize in a scene?]

The role of fragment detectors may be fulfilled by tho
se neurons in the IT cortex that respond
selectively to some particular views of an object or to a specific shape irrespective of view
(Logothetis and Sheinberg, 1996; Rolls, 1996; Tanaka, 1996). This very kind of shape
-
selective
response may also constitu
te the neural basis of binding by retinotopy, in which the visual field
itself can serve as the frame encoding the relative positions of object fragments, simply because
each such fragment is already localized within that frame when it is detected (
Didday
and Arbib,
1975).
Cf. the notion of the visual world serving as an "external memory" (O'Regan, 1992) or its
own representation (Edelman, 1998).

Binding by retinotopy is possible if the receptive field of each cell is confined to some
relatively limited por
tion of the entire visual. Such response properties have been found in the IT
cortex (Kobatake and Tanaka, 1994) and the term what+where response type has been coined to
describe the joint tuning of cells in the prefrontal (PF) cortex (Rao et al., 1997; Ra
iner et al., 1998).
An imaging study of the representation of structure in IT (Tsunoda et al., 1998) revealed
ensembles of neurons responding to overlapping sets of "moderately complex" geometrical
features, spatially bound into distributed codes for entir
e objects; see also (Perrett and Oram,
1998; Tsunoda et al., 1999; Tsunoda and Tanifuji, 2000).

Neu
romorphic models of visual structure processing

JIM.3

The JIM.3 model(Hummel, 2000a), which has evolved from JIM (Hummel and Biederman,
1992), combines the

classical and the alternative. It is structured as an 8
-
layer network (Figure 1).

The first three layers extract local features: contours, vertices and axes of symmetry, and
surface properties. Surfaces are represented in terms of five categorical proper
ties: (1) elliptical or
not; (2) possessing parallel, expanding, convex or concave axes of symmetry; (3) possessing
curved or straight major axis; (4) truncated or pointed; (5) planar or curved in 3D. Units coding
these local features group themselves into

representations of geons by synchrony of firing. These
representations are then routed by the units of layer 4 to two distinct destinations in layer 5. The
first of these is a population of units coding for geons and spatial relations that are independent

or "disembodied" in the sense that each of them may have originated from any location within
the image. Within this population, the emergence of a representation of the object's structure
requires dynamic binding, which the model stipulates to be carried
our under attentional
guidance and to take a relatively long time (a few hundred milliseconds).

The second destination of the outgoing connections of layer 4 is a population of geon units
arranged in the form of a retinotopic map. Here, the relations betwe
en the geons are coded
implicitly, by virtue of each representation unit residing in the proper location within the map,
which reflects the location of the corresponding geon in the image. In contrast to the attention
-
controlled stream, this one can operat
e much faster, and is postulated to be able to form a
structural representation in a few tens of milliseconds. This speed and automaticity have a price:
because of the fixed spatial structure imposed by the retinotopic map, the representation this
stream s
upports is more sensitive to object transformations such as rotation in depth and
reflection (Hummel, 2000a).



Figure 1:

The architecture of the JIM.3 model (Hummel, 2000a). The model had been trained on a single
view (actually, a line drawing) of each
of 20 objects: hammer, scissors, etc., as well as some "nonsense"
objects. It was then tested on translated, scaled, reflected and rotated (in the image plane) versions of the
same images. The model exhibited a pattern of results consistent with a range of

psychophysical data
obtained from human subjects (Stankiewicz et al., 1998; Hummel, 2000a). Specifically, the categorization
performance was invariant with respect to translation and scaling, and was reduced by rotation. Moreover,
due to the dual nature o
f the binding process in JIM.3


dynamic and static/retinotopic


the model behaved
differently given attended and unattended objects: reflected images primed each other in the former, but not
in the latter case. (Figure courtesy of J. Hummel).

Chorus of
Fragments or CoF

This model exemplifies the coarse
-
coded fragment
-
based approach to the representation of
structure (Edelman and Intrator, 2000a; Edelman and Intrator, 2001). It simulates cells with
what+where receptive fields (above) to represent object
fragments, and uses attentional gain
fields, such as those found in area V4 (Connor et al., 1997), to decouple the representation of
object structure from its location in the visual field.

#Explain gain fields.#

Unlike JIM.3, the CoF system operates direc
tly on gray
-
level images, pre
-
processed by a front
end that simulates the primary visual cortex (Heeger et al., 1996), with complex
-
cell responses
modified to use the MAX operation suggested in (Riesenhuber and Poggio, 1999). The system
illustrated in Figu
re 2 contains two what+where units, one (labeled "above center") responsible
for the top fragment of the object (as extracted by an appropriately configured Gaussian gain
field), and the other (labeled "below center") responsible for the bottom fragment. T
he units are
trained jointly for three
-
way discrimination, for translation tolerance, and for autoassociation.

[MAA: cf. Fukushima's Neocognitron (see article in 1e).]



Figure 2:

The CoF model, trained on three composite objects (numerals 4 over 5, 3 ov
er 6, and 2
over7). The model consists of two what+where units, responsible for the top and the bottom
fragments of the stimulus, respectively. Gain fields (boxes labeled below center and above center)
steer each input fragment to the appropriate unit. The

learning mechanism (R/C, for
Reconstruction and Classification) can be implemented either as a multilayer perceptron, or as a
radial basis function network. The reconstruction error (

) modulates the classification outputs
and helps the system learn bindi
ng (a co
-
activation pattern over units of the preceding stage will
have a small reconstruction error only if both its what and where aspects are correct).

Figure 3 shows the performance of a CoF system charged with learning to reuse fragments of
the member
s of the training set (three bipartite objects composed of numeral shapes) in
interpreting novel composite objects. The gain field mechanism allowed it to respond largely
systematically to the learned fragments shown in novel locations, both absolute and r
elative. The
CoF model offers a unified framework, which can be linked to the Minimum Description

Length
principle (Edelman and Intrator, 2001), for the understanding of the functional significance of
what+where receptive fields and of attentional gain mod
ulation. It extends the previous use of
gain fields in the modeling of translation invariance (Salinas and Abbott, 1997), and highlights a
parallel between what+where cells and probabilistic fragment
-
based approaches to structure
representation in computer

vision, such as that of (Burl et al., 1998). The representational
framework it embodies is both productive and effectively systematic. It is capable, as a matter of
principle, of recognizing as such objects that are related through a rearrangement of "mid
dle
-
scale" parts, without being taught those parts individually, and without the need for dynamic
binding. Further testing is needed to determine whether or not the CoF model can be scaled up to
learn larger collections of objects, and to represent finer s
tructure, under realistic transformations
such as rotation in depth.


Figure 3:

The response of the CoF model to a familiar composite object at a novel location (test I), and novel
compositions of fragments of familiar objects (tests II and III). In the t
est scenario, each unit ( above and
below ) must be fed each of the two input fragments ( above and below ), hence the 12 bars in the plots of the
model's output.

[MAA: The example is rather artificial. Can you give some sense as to how the method might
e
xtend to, e.g., recognition of a human form where the drape of clothing can greatly change the
shape of any component or the appearance where they come together.]

Conclusions

Neither recognition, nor categorization require a prior derivation of a classica
l structural
description. Moreover, making structure explicit may not be a good idea, either from a
philosophical viewpoint, or from a practical one. On the philosophical level, it embodies an
gratuitous ontological commitment (Schyns and Murphy, 1994) to
the existence of object parts
; on
the practical level, reliable detection of such parts proved to be an elusive goal. Moreover, a
system can be productive and systematic without relying on representations that are
compositional in the classical sense (Chri
sley, 1998; Edelman and Intrator, 2000b). Structure can
be represented by a coarse code based on image fragments, bound together by retinotopy. This
notion is supported by the success of computer vision methods (such as "local photometry, global
geometry")
, by data from neurophysiological studies in primates (such as the discovery of
what+where cells), as well as by psychological findings and by meta
-
theoretical considerations
not mentioned here (Edelman and Intrator, 2000b).

[MAA: Complementary issue: Hav
ing recognized the object, how do we extract relevant
parameters, e.g., the pose of a hand?]

References

Biederman, I. (1987). Recognition by components: a theory of human image understanding.
Psychol. Review, 94:115
-
147.

Biederman, I. (2000). Recognizing d
epth
-
rotated objects: A review of recent research and theory.
Spatial Vision,
-
:
-
. in press.

Bienenstock, E. (1996). Composition. In Aertsen, A. and Braitenberg, V., editors, Brain Theory
-

Biological Basis and Computational Theory of Vision, pages 269
-
300
. Elsevier.

Bienenstock, E. and Geman, S. (1995). Compositionality in neural systems. In Arbib, M. A., editor,
The handbook of brain theory and neural networks, pages 223
-
226. MIT Press.

Bienenstock, E., Geman, S., and Potter, D. (1997). Compositionality,
MDL priors, and object
recognition. In Mozer, M. C., Jordan, M. I., and Petsche, T., editors, Neural Information
Processing Systems, volume 9. MIT Press.

Binford, T., 1971, Visual Perception by a Computer,
IEEE Conference on Systems and Controls
,
Miami. Fl
orida.

Burkhardt and B. Neumann (Eds.), LNCS
-
Series Vol. 1406
-
1407, Springer
-
Verlag, pages 628
-

Burl, M. C., Weber, M., and Perona, P. (1998). A probabilistic approach to object recognition using
local photometry and global geometry. In Proc. 4 th Europ. C
onf. Comput. Vision, H.

Camps, O. I., Huang, C.
-
Y., and Kanungo, T. (1998). Hierarchical organization of appearance
-

based parts and relations for object recognition. In Proc. ICCV, pages 685
-
691. IEEE.

Chrisley, R. (1998). Non
-
compositional representation

in connectionist networks. In Niklasson, L.,
Boden, M., and Ziemke, T., editors, ICANN 98: Proceedings of the 8th International
Conference on Artificial Neural Networks, pages 393
-
398, Berlin. Springer.

Connor, C. E., Preddie, D. C., Gallant, J. L., and V
an Essen, D. C. (1997). Spatial attention efiects in
macaque area V4. J. of Neuroscience, 17:3201
-
3214.

Desimone R.. Face
-
selective cells in the temporal cortex of monkeys.
Journal of Cognitive
Neuroscience
, 3:1
-
8, 1991.

Desimone R., E.K. Miller, L. Chelaz
zi, and A. Lueschow. Multiple memory systems in visual
cortex. In M.S. Gazzaniga, editor,
Cognitive Neurosciences
, chapter 30, pages 475
-
486. MIT Press:
New York, 1995.

Dickinson, S., Bergevin, R., Biederman, I., Eklundh, J., Munck
-
Fairwood, R., Jain, A.,
and
Pentland, A. (1997). Panel report: The potential of geons for generic 3
-
d object recognition.
Image and Vision Computing, 15:277
-
292.

Dickinson, S. J., Pentland, A. P., and Rosenfeld, A. (1992). 3
-
D shape recovery using distributed
aspect matching. IEE
E Transactions on Pattern Analysis and Machine Intelligence, 14:174
-
98.

Didday, R.L., and Arbib, M. A., 1975, Eye movements and visual perception: 'two visual systems'
model,
Int. J. Man
-
Machine Studies
, 7: 547
-
569.

Du, L. and Munck
-
Fairwood, R. (1995). G
eon recognition through robust feature grouping. In
Scandinavian Conference on Image Analysis, pages 715
-
722.

Duvdevani
-
Bar, S. and Edelman, S. (1999). Visual recognition and categorization on the basis of
similarities to multiple class prototypes. Interna
tional Journal of Computer Vision, 33:201
-

Edelman, S. (1994). Biological constraints and the representation of structure in vision and
language. Psycoloquy, 5(57). FTP host: ftp.princeton.edu; FTP directory:
/pub/harnad/Psycoloquy/1994.volume.5/; file nam
e: psyc.94.5.57.language
-

network.3.edelman. Edelman, S. (1998). Representation is representation of similarity.
Behavioral and Brain Sciences, 1:449
-
498.

Edelman, S. and Intrator, N. (2000a). (Coarse Coding of Shape Fragments) + (Retinotopy) fi
Representa
tion of Structure. Spatial Vision,
-
:
-
. in press.

Edelman, S. and Intrator, N. (2000b). A framework for object representation that is shallowly
structural, recursively compositional, and effectively systematic.
-
,
-
:
-
. in preparation.

Edelman, S. and Intra
tor, N. (2001). A productive, systematic framework for the representation of
visual structure. In Leen, T., editor, NIPS (Advances in Neural Information Processing
Systems), volume 13, Cambridge, MA. MIT Press.

Fabre
-
Thorpe M, Delorme A, Marlot C, Thorpe S
. 2001. A limit to the speed of processing in
ultra
-
rapid visual categorization of novel natural scenes.
J Cogn Neurosci

13: 171
-
80.

Fabre
-
Thorpe M, Richard G, Thorpe SJ. 1998. Rapid Categorization of Natural Images by Rhesus
-
Monkeys.
Neuroreport

9: 303
-
8

Farah. M.J.
Visual Agnosia: Disorders of Object Recognition and What They Can Tell Us About Normal
Vision
. Cambridge, Massachusetts: MIT Press, 1990.

Fodor, J. A. (1998). Concepts: where cognitive science went wrong. Clarendon Press, Oxford.

Fodor, J. and
McLaughlin, B. (1990). Connectionism and the problem of systematicity: Why
Smolensky's solution doesn't work. Cognition, 35:183
-
204.

Gray, C. M., König, P., Engel, A. K., and Singer, W. (1989). Oscillatory responses in cat visual
cortex exhibit inter
-
colum
nar synchronization which re ects global stimulus properties.
Nature, 38:334
-
337. Hadley, R. F. (1997). Cognition, systematicity, and nomic necessity. Mind
and Language, 12:137
-
53.

Heeger, D. J., Simoncelli, E. P., and Movshon, J. A. (1996). Computational

models of cortical
visual processing. Proceedings of the National Academy of Science, 93:623
-
627.

Hinton, G.E. J.L. McClelland, and D.E. Rumelhart. Distributed representations. In D.E. Rumelhart
and J.L. McClelland, editors,
Parallel Distributed Processin
g,
volume 1: Foundations, chapter 3.
Cambridge, Massachusetts: MIT Press, 1986.

Hummel, J. E. (2000a). Complementary solutions to the binding problem in vision.
-
, pages
-
.
submitted.

Hummel, J. E. (2000b). Where view
-
based theories of human object recogni
tion break down: the
role of structure in human shape perception. In Dietrich, E. and Markman, A., editors,
Cognitive Dynamics: conceptual change in humans and machines, chapter 7. Erlbaum,
Hillsdale, NJ.

Hummel, J. E. and Biederman, I. (1992). Dynamic bin
ding in a neural network for shape
recognition. Psychological Review, 99:480
-
517.

Intraub H. 1999. Understanding and Remembering Briefly Glimpsed Pictures: Implications for
Visual Scanning and Memory. In
Fleeting memories : cognition of brief visual stimul
i
, ed. V
Coltheart. Cambridge, Mass.: MIT Press

Kawano K. 1999. Ocular tracking: behavior and neurophysiology.
Curr Opin Neurobiol

9: 467
-
73.

Keysers C, Xiao DK, Foldiak P, Perrett DI. 2001. The speed of sight.
J Cogn Neurosci

13: 90
-
101.

Kirschfeld, K. (1
995). Neuronal oscillations and synchronized activity in the central nervous
system: functional aspects. Psycoloquy, 6(36). available electronically as
ftp://ftp.princeton.edu/pub/harnad/Psycoloquy/1995.volume.6/psyc.95.6.36.brain
-

rhythms.11.kirschfeld.

K
obatake, E. and Tanaka, K. (1994). Neuronal selectivities to complex object features in the
ventral visual pathway of the macaque cerebral cortex. J. Neurophysiol., 71:856
-
867.

Kobatake, E. K. Tanaka, and G. Wang. Effects of shape discrimination learning o
n the stimulus
selectivity of inferotemporal cells in adult monkeys.
Journal of Neurophysiology,

80:324
-
330,
1998.

Lamme VA, Super H, Spekreijse H. 1998. Feedforward, horizontal, and feedback processing in
the visual cortex.
Curr Opin Neurobiol

8: 529
-
35

L
ogothetis N.K.. Object vision and visual awareness.
Current Opinion in Neurobi
-

ology
, 8(4):536
-
544, 1998.

Logothetis, N. K. and Sheinberg, D. L. (1996). Visual object recognition. Annual Review of
Neuroscience, 19:577
-
621.

Marr, D. (1982). Vision. W. H. F
reeman, San Francisco, CA.

Marr, D. and Poggio, T. (1977). From understanding computation to understanding neural
circuitry. Neurosciences Res. Prog. Bull., 15:470
-
488.

Miles FA. 1997. Visual stabilization of the eyes in primates.
Curr Opin Neurobiol

7: 86
7
-
71

Miyashita Y.. Inferior temporal cortex: Where visual perception meets memory.
Annual Review of
Neuroscience
, 16:245
-
263, 1993.

Mouchetant
-
Rostaing Y, Giard MH, Delpuech C, Echallier JF, Pernier J. 2000. Early signs of visual
categorization for biologi
cal and non
-
biological stimuli in humans.
Neuroreport

11: 2521
-
5.

Murase, H. and Nayar, S. (1995). Visual learning and recognition of 3D objects from appearance.
International Journal of Computer Vision, 14:5
-
24.

Nelson, R. C. and Selinger, A. (1998). Larg
e
-
scale tests of a keyed, appearance
-
based 3
-
D object
recognition system. Vision Research, 38:2469
-
2488.

Oram MW, Perrett DI. 1992. Time course of neural responses discriminating different views of
the face and head.
J Neurophysiol

68: 70
-
84

O'Regan, J. K.

(1992). Solving the real mysteries of visual perception: The world as an outside
memory. Canadian J. of Psychology, 46:461
-
488.

Perrett, D. I. and Oram, M. W. (1998). Visual recognition based on temporal cortex cells: viewer
-

centred processing of pattern

configuration. Z. Naturforsch. [C], 53:518
-
541.

Potter MC. 1999. Understanding Sentences and Scenes: The Role of Conceptual Short
-
Term
Memory. In
Fleeting memories : cognition of brief visual stimuli
, ed. V Coltheart. Cambridge,
Mass.: MIT Press

Rainer, G
., Asaad, W., and Miller, E. K. (1998). Memory fields of neurons in the primate
prefrontal cortex. Proceedings of the National Academy of Science, 95:15008
-
15013.

Rao, S. C., Rainer, G., and Miller, E. K. (1997). Integration of what and where in the primat
e
prefrontal cortex. Science, 276:821
-
824.

Rensink R.A.. The dynamic representation of scenes.
Visual Cognition
, 7:17
-
42, 2000.

Riesenhuber, M. and Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature
Neuroscience, 2:1019
-
1025.

Ri
ngach DL, Hawken MJ, Shapley R. 1997. Dynamics of orientation tuning in macaque primary
visual cortex.
Nature

387: 281
-
4.

Rolls, E. T. (1996). Visual processing in the temporal lobe for invariant object recognition. In
Torre, V. and Conti, T., editors, Neu
robiology, pages 325
-
353. Plenum Press, New York.

Sali, E. and Ullman, S. (1999). Detecting object classes by the detection of overlapping 2
-
D
fragments. In Proc. 10th British Machine Vision Conference, volume 1, pages 203
-
213.

Salinas, E. and Abbott, L. F
. (1997). Invariant visual responses from attentional gain fields. J. of
Neurophysiology, 77:3267
-
3272.

Schyns, P. G. and Murphy, G. L. (1994). The ontogeny of part representation in object concepts. In
Medin, D., editor, The Psychology of Learning and Mot
ivation, volume 31, pages 305
-
354.
Academic Press, San Diego, CA.

Singer, W. and Gray, C. M. (1995). Visual feature integration and the temporal correlation
hypothesis. Annual review of neuroscience, 18:555
-
586.

Stankiewicz, B., Hummel, J. E., and Cooper,
E. E. (1998). The role of attention in priming for left
-
right re ections of object images: evidence for a dual representation of object shape. Journal of
Experimental Psychology: Human Perception and Performance, 24:732
-
744.

Sugase Y, Yamane S, Ueno S, Kaw
ano K. 1999. Global and fine information coded by single
neurons in the temporal visual cortex.
Nature

400: 869
-
73

Tanaka, K. (1996). Inferotemporal cortex and object vision. Annual Review of Neuroscience,
9:109
-
139.

Tanaka, K., Saito, H., Fukada, Y., and
Moriya, M. (1991). Coding visual images of objects in the
inferotemporal cortex of the macaque monkey. J. Neurophysiol., 66:170
-
189.

Thorpe S, Fize D, Marlot C. 1996. Speed of processing in the human visual system.
Nature

381:
520
-
2

Thorpe SJ, Imbert M. 19
89. Biological constraints on connectionist models. In
Connectionism in
Perspective.
, ed. R Pfeifer, Z Schreter, F Fogelman
-
Soulié, L Steels, pp. 63
-
92. Amsterdam:
Elsevier

Tsunoda, K. and Tanifuji, M. (2000). Direct evidence for feature
-
based representati
on of visually
presented objects in monkey inferotemporal cortex revealed by optical imaging.
-
,
-
:
-
.
submitted.

Tsunoda, K., Fukuda, M., and Tanifuji, M. (1999). Feature
-
based representation of objects in
Macaque area TE revealed by intrinsic signal imagi
ng. In Soc. Neurosci. Abstr., volume 25,
page 918.

Tsunoda, K., Nishizaki, M., Rajagopalan, U., and Tanifuji, M. (1998). Optical imaging of
functional structure evoked by complex and simplified objects in Macaca area TE. Society for
Neuroscience Abstracts,

24:897.

Ullman, S. (1979). The interpretation of visual motion. MIT Press, Cambridge, MA.

Ullman, S. (1989). Aligning pictorial descriptions: an approach to object recognition. Cognition,
2:193
-
254.

Ungerleider L.G. and J.V. Haxby. 'what' and 'where' in t
he human brain.
Current Opinion in
Neurobiology
, 4:157
-
165, 1994.

VanRullen R, Thorpe SJ. 2001. The time course of visual processing: from early perception to
decision
-

making.
J Cogn Neurosci

13: 454
-
61.

Vogels, R., Biederman, I., Bar, M., and Lorincz, A.

(2000). Inferior temporal neurons show greater
sensitivity to nonaccidental than metric difierences. Journal of Cognitive Neuroscience. in
press.

von der Malsburg, C. (1999). The what and why of binding: The modeler's perspective. Neuron,
4:95
-
104.

Wallis

G.. Spatio
-
temporal influences at the neural level of object recognition.
Net
-

work:
Computation in Neural Systems
, 9(2):265
-
278, 1998.

Wallis G.. Temporal association in a feed
-
forward framework.
Network: Computa
-

tion in Neural
Systems
, 10(3):281
-
284, 1
999.

Wallis G. and E. T .Rolls. A model of invariant object recognition in the visual system.
Progress in
Neurobiology
, 51:167
-
194, 1997.

Wallis G. and H.H. Biilthoff. Learning to recognize objects.
Trends in Cognitive Sciences
, 3:22
-
31,
1999.

Young. M.P.
Objective analysis of the topological organization of the primate cor
-

tical visual
-
system.
Nature
, 358:152
-
155, 1992.

Young, M., Tanaka, K., and Yamane, S. (1992). On oscillating neuronal responses in the visual
cortex of the monkey. J. of Neurophysiolog
y, 67:1464
-
1474.