Gaze modulated visual search integrating spatial and feature data; embodied visual memory for robotic systems II

fangscaryΤεχνίτη Νοημοσύνη και Ρομποτική

13 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

58 εμφανίσεις

Gaze modulated visual search integrating spatial

and feature data; embodied visual memory for

robotic systems II


Martin Hülse
*
, Sebastian McBride, Mark Lee

Dept. of Computer Science, Aberystwyth University, SY23 3DB, Wales, UK

Email:msh@aber.ac.uk

Abstra
ct

Substantial evidence supports the role of lateral
intraparietal region (LIP) of the brain

as

the central
processing point where bottom
-
up visual information is
modulated by top
-
down task information from higher
cortical structures
. It also contains a g
lobal egocentric
as opposed to a local retinotopic mapping and thus is
also considered critical for the
accumulat
ion of a
coherent view of the surrounding environment

in the
context of an ever changing visual scene
.

We have developed an
active vision sys
tem
architecture
based on the LIP

structure as its central
element.
This

architecture
,

as an extension of
that
previously presented (
Hülse et al. 2009), now considers
feature data
and

has the ability to modulate visual
search according to specific object p
roperties
.
This
architecture is discussed in terms of its ability to
generate visual search for active robotic vision systems.



1. Introduction


S
accade
s

allow objects or events of high attentional
priority to be processed at high resolution whilst
keep
ing low attentional visual data within the periphery
at low resolution. With this strategy of computational
efficiency, however, comes the problem of an ever
changing visual scene and the potential inability to
accumulate a coherent view of the surrounding

environment
(Burr and Morrone, 2010)
. One of the key
requirements t
o attaining a stable representation of the
world and thereby a meaningful ba
lance between
targeted search for specific objects and sensitivity to
new stimuli, is the ability to store gathered visual
information from saccades within an egocentric
framework.
Within the brain, t
he lateral intra
-
parietal
area

(LIP) is the primary ca
ndidate region responsible
for this particular attribute
of

the active vision system.
Evidence suggests that the LIP is responsible for both
long
-
term and short term inhibition of return (IOR) (not
saccading to a previously fixated object)
(Vivas et al.,
2008)

and, given its connections with
both the dorsal
and ventral visual stream, it is also considered to hold
both spatial and saliency information about objects
respectively
(Singh
-
Curry and Husain, 2009)
,

in what
is often referred to as an egocentric saliency or priority
map
(Gottlieb et al., 2009)
. The LIP region, therefore



appears to be the central processing point where
bottom
-
up visual information is modulated by top
-
down task information com
ing from higher cortical
structures

(Platt and Glimcher, 1999)
.

Thus
,

it appears
to be a region critical for both attentional processes and
the starting point for goal directed behaviour.


Based on these
biologica
l
findings, we have developed
an active vision system architecture

based on the

bottom
-
up
-
top
-
down approach that incorporates an
equivalent

LIP

structure

as its central element, referred
to here as the visual memory in gaze space. We show
that low resolut
ion spatial information as well as high
resolution feature data can be combined within this
map to allow the modulation of visual search according
to both spatial and feature information.

This
information can be provided locally, through the
current visual

input, or globally by the visual memory.
This architecture is discussed in terms of its ability to
generate visual search for active robotic vision systems.



2. Computational Architecture


The architecture is presented in Fig. 1. It operates in
two comp
utational domains: the retinotopic and the
gaze space. The retinotopic space represents the visual
data provided by the camera

where
the visual input
from the human retina is simulated by splitting the
current RGB image input into two channels. One
provide
s low resolution image data of the camera’s
complete field of view (box labelled “spatial filter”),
whilst the second provides high resolution image data
from the centre of the camera image data (“feature
filter”), similar to the fovea. The feature filter
generates
values for specific features found in the centre of the
image and stores them as a feature vector (< f >). The
spatial filter generates a saliency map in a conventional
bottom
-
up manner (Itti and Koch, 1999).

Spatial visual stimuli represented in

the form of a
saliency map are mapped into the gaze space. This is
possible through a mapping that generates eye saccade
movements.



Figure 1: Computation architecture for visual search


Co
-
ordinates of stimuli in the retinotopic reference
frame (X;Y)
can be transformed into absolute motor
positions (< p >) using pan and tilt information from
the active vision system. These absolute motor
positions are the positions the active vision system
would achieve if saccades towards these stimuli were
executed.

However, before potential target
configurations < p> are fed into an action selection
process (WTA) to derive a target for the next eye
saccade, this information is modulated by the visual
memory.

The visual memory contains a set of absolute motor
positio
n
s

resulting from successful saccades with each
linked to a feature vector (< p>;< f >). A saccade is
counted successful if there is high activation in the
central area of the spatial filtered image data. When this
occurs, the currently computed feature ve
ctor < f > is
associated with the current absolute position < p> of
the active vision system. This pair of spatial and feature
data is then stored in the visual memory.

Originally
, we introduced an architecture where
modulation of visual input came only fr
om previously
acquired spatial data (Hülse et al., 2009). This had the
primary purpose of providing an inhibition of return
mechanism without any considerations of object
features. The architecture in this paper, as an extension
of Hülse et al. (2009), n
ow considers feature data
that

has the ability to modulate visual search according to
specific object properties,

i.e integrates a bottom
-
up
-
top
-
down approach similar to that observed in
biological systems
(Rothkopf et al., 2007).

An
additional
important f
eature of this visual memory
is that it
also
contains object data outside the current
visual input (“RGB image”). This is referred to as
‘global’ information, as compared to the ‘local’ spatial
data originating from the visual input.

This
feature

allows t
he
architecture
to select p
otential target
positions < p > from both inside and outside the current
visual view.


3
. Conclusions


Gaze space and visual memory provide the
computational substrate for
a
visual search task that can
vary between purely reac
tive (bottom
-
up) and task
driven (top
-
down).

The core concepts of this
architecture are, firstly, the

sensorimotor mapping that
transforms visual data from

retinotopic refe
rence frame
into the gaze space. Se
condly, the visual memory
represented in gaze s
pace combining high resolution
feature data with spatial information, and thirdly, the
modulation of the current visual input by the visual
memory. Some of these concepts can be found in
biological systems and therefore we argue that this
architecture is a

p
otential
step towards a

biological

model for visual search
as well as being a promising
engineering

template for
active vision systems
within
humanoid robots.


Acknowledgment

Thanks for support from EC
-
FP7 projects IM
-
CLeVeR

and ROSSI, and EPSRC, grant E
P/C516303/1.


References


Burr, D.C., Morrone, M.C.,
(
2010
)
. Vision: Keeping
the World Still When the Eyes Move. Current Biology
20, R442
-
R444.

Gottlieb, J., Balan, P.F., Oristaglio, J., Schneider, D.,
(2009). Task specific computations in attentional maps
.
Vision Research 49, 1216
-
1226.

Hülse, M., McBride, S., and Lee, M. (2009).
Implementing

inhibition of return; embodied visual
memory

for robotic systems. In 9th International
Conference

on Epigenetic Robotics: Modeling
Cognitive

Development in Robotic Sy
stems.

Itti, L. and Koch, C. (1999). A saliency
-
based search

mechanism for overt and covert shifts of visual
attention.

Vision Research, 40:1489

1506.

Platt, M.L., Glimcher, P.W., (1999). Neural correlates
of decision variables in parietal cortex. Nature 4
00,
233
-
238.

Rothkopf, C., Ballard, D., and Hayhoe, M. M. (2007).

Task and context determine where you look. Journal

of Vision, 7:1

20.

Singh
-
Curry, V., Husain, M., (2009). The functional
role of the inferior parietal lobe in the dorsal and
ventral stream
dichotomy. Neuropsychologia 47, 1434
-
1448.

Vivas, A.B., Humphreys, G.W., Fuentes, L.J., (2008).
Object
-
based inhibition of return in patients with
posterior parietal damage. Neuropsychology 22, 169
-
176.