>> Zhengyou Zhang: Okay. So my great pleasure to introduce professor John Tsotsos. John is a professor at York University in Canada, and he's one of the very few people who really work across many areas in cognitive vision, computational vision and computer vision. And today he will talk about a topic which I think is very interesting combination, computation and computer. Thank you. >> John K. Tsotsos: Thank you so much. It's a pleasure to be here. So I'm going to talk about basically it's a visual search talk today, and in the -- if you want to categorize

blondglibΠολεοδομικά Έργα

29 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

161 εμφανίσεις

>> Zhengyou Zhang: Okay. So my great pleasure to introduce professor John Tsotsos.
John is a professor at York University in Canada, and he's one of the very few people
who really work across many areas in cognitive vision, computational vision and
comp
uter vision. And today he will talk about a topic which I think is very interesting
combination, computation and computer. Thank you.


>> John K. Tsotsos: Thank you so much. It's a pleasure to be here. So I'm going to
talk about basically it's a vi
sual search talk today, and in the
--

if you want to categorize
it, it's in the active visual domain for a mobile robot. So we'll start off with a little bit of
motivation as to why we're looking at the problem and why we're looking at it from this
partic
ular angle that you'll see. I'll talk a lot about active object search, and I'll give you
some background in work we've done in the past and then a lot of detail on how exactly
we're solving the problem currently. I'll show you a number of examples and a

couple of
experiments that show a little bit of an evaluation of the method all of that will also
include comparison to other techniques for doing similar things, and then we'll conclude
with basically giving you an update on the overall project because t
he active search is
only part of an overall project.


So let's begin with the motivation. And the motivation initially was we wanted to build a
wheelchair for the disabled, an autonomous wheelchair that's visually guided for the
disabled. So in those c
ases the dream is really to provide mobility for those people who
don't have mobility or have some restrictions in terms of their motor systems.


Now, it's interesting that a lot of things that we used to were predicted by Gene
Rodenbury [phonetic] in St
ar Trek, but not this. So if you remember the episode the
Glass Menagerie with Captain Pike, he was in a wheelchair but he was pushed around
everywhere. He didn't have his own mobility. So it's curious to me that Gene
Rodenbury didn't foresee the fact t
hat someday these chairs would be autonomous.


The first of the papers to appear in this area is from 1986. So and there's little picture in
the corner of what that wheelchair looked like. And basically it was a self navigating
chair that had a number
of sensors on it and the main goal was mobility. And that's
really it. In a crowded building and so forth.


That still remains the major goal of all of the groups that do this sort of research. But
since then there have been a lot of different sorts o
f wheelchairs and I've got just some
pictures of a number of them. And you know, you could just briefly take a look at them.
They span many years. And they kind of all look the same except, you know, down in
the corner here Dean Kamen's ibot is a little

bit different and has some unique
elements. A even some of the more modern ones that have, you know, some sexier
design. Really all we can say is that even this Suzuki chair looks a lot like what captain
Christopher pike had in Star Trek, and that he wa
s just sitting in it and just provides
mobility.


Our motivation is a little bit different. And it really started when I was watching television
on a Saturday morning in 1991. I was babysitting my son who was 18 months old at the
time and I was looking

at the Canadian version of your PBS network. And there was
featuring a robotocist, Gary Squire, Gary Birch, rather, from the Neil Squire Foundation.
And I didn't know him, so I thought let me watch. And my son was playing happily in
front of
--

you kno
w, right below the television.


So what he had shown is a system that he had developed for a disabled children to be
able to play with toys. So imagine the following sort of scenario. There was a little boy
in his wheelchair wearing a bicycle helmet.
In front of him was a table, and the table
had a number of toys on it. Above the table was a began entry robot with arm and
manipulator. The arm's joints were color coded. Beside the child's head was a paddle
wheel where the paddles were color coded the

same colors as the joints.


So now what the child would do is with his head hit paddles of the right color and every
time he would hit the given color, one of the joints would move one step. So after doing
this a lot of times, he managed to move the be
gan entry robot and arm down to grasp a
toy, a single toy.


So I was watching this and my healthy son was playing on the floor below, and you
know, it kind of you know got to me saying you know and I thought this is my area, we
should be around to do bet
ter than this. So that's really when the project started back
then. It's been
--

it was initially easy to fund because people thought wow, this is great,
a lot of social and economic value. And then as we got more developed it was difficult
to final bec
ause people thought, well who's going to pay for such a thing? It's really
expensive and stuff like that. So the funding part has always been really difficult. And
that's why it's taken a long time to make a lot of progress.


So our initial focus was
not on navigation, it was really on how to ease the task of
instructing the robot and having a robot deal with a search and grasp kind of a task. In
other words, to short circuit this tedium that the child that I saw in that wheelchair has to
face in orde
r to grasp something. If we could only find an easier way to instruct the
robot and then have the robot take over and visually final everything and do all the
planning and so forth it would have been easier.


So our focus is very different than most gro
ups that focus only on navigation. Add in the
fact that my lap has traditionally looked at vision only, we wind up with the purely vision
based robot. So there are no sonar, no laser on it and so forth. And I'll show you that
towards the end of the talk
.


Obviously this leads us to an active approach because this robot has to move around.
So it's not
--

we're not looking at static images and we need to be able to control
acquisitions. So if you're faced with a task of how do I find something in a roo
m like a
toy, it's an active approach. So there's lots of reasons why active approaches are
valuable, and I just give a big long list of them here, things that I have written in my
papers over the past and hopefully trying to convince you that there is re
ally a lot of
room for active vision in the world that is different from the kinds of, you know, single
image static approaches that one sees most commonly in computer vision.


The work we've done in the past in active vision is pretty interesting in my
mind, and it
started with a PhD student of mine named David Wilkes. I'll show you an example of
his work. It then moved to some work that I did when Sven Dickinson was my post
-
doc
at the University of Toronto and with Henrik Christensen who is at KTH now

in Georgia
Tech, and then with another PhD student Yiming Ye and a master student Sonia
Sabena [phonetic] which and that will be the focus of today's talk.


But I'll just give you these first couple of examples. So this is from Sven Dickinson's
work.
And basically those of you who know what Sven has been doing, it's been
focusing on object recognition over the years. And he initially looked at having an
aspect graph representation of objects which is shown up toward
--

on the left side.
And he though
t, when he was working with me, that suppose we use the links on the
aspect graph to encode information about which directions to move the camera in order
to acquire different views of the object.


So in particular, if you have a degenerate view such as
this one here of the cube, and
you try to match it to an aspect graph representation, and you match a particular face
on that aspect graph, looking at the links might tell you in which directions to move the
camera to disambiguate a cube from an elongated
device, structure, okay? So here is
the coding that you see. I'm not going to go through it that detail of coding. It's not
interesting. Suffice it to say that for each of the links there would be some information
that would give you the next viewpoint

and sure enough then you can move to the least
ambiguous aspect of them all just by looking at the different links and comparing them
in order to do the disambiguation. Sure?


>>: I guess [inaudible].


>> John K. Tsotsos: Yes?


>>: You mentione
d that the, you know, the reason for doing all these is to help people
[inaudible].


>> John K. Tsotsos: Yes.


>>: I'm just kind of wondering how you expect [inaudible] to be able to specify what he
or she wants.


>> John K. Tsotsos: Okay. I'll

answer
--

okay. I'll answer that now so that everybody
--

if anyone has the same question. So what we have right now on our wheelchair is
there's a small touch pad on it, touch sensitive screen, and we're assuming that the user
is able to at least move
their finger to touch things. So the kind of patient that I was
thinking of is perhaps a cerebral palsy child who is accustomed to communicating using
the bliss language.


The bliss language is a symbolic language and typically like, you know, on a tabl
e in
front of them, they have a board with a number of symbols, usually paper or plastic
symbols, and they point, they have, you know, course control over their arm, and they
point to these symbols in a sequence in order to communicate. So our touch sensi
tive
screen is an extension of this bliss language. It includes the bliss symbols but also
includes pictures of the toys, little video clips of actions, pictures of colors and so forth.
So that you can touch a sequence of those and basically create a sho
rt sentence with
five or six touches that gives the instruction to the robot of what you want. Pick up
block, put down here or something like that. Does that answer? Okay.


The second bit of work I wanted to show you is what David Wilkes had done, and

his
project was to try and recognize origami objects out of a jumble of them, so you would
take a bunch of these origami objects and sort of drop them on a table and you want to
be able to recognize them. And he had created an internal representation of
these that
was basically wire framed but where you had the faces of these objects specially
labeled to be faces that you could use to disambiguate one object from another. So if
you think of it as a hypothesizing test kind of a framework of trying to inde
x into a
database of objects you would extract information about a face of an object, so in other
words, one of those wings, let's say, and use that as an index into the space of objects,
come up with the number
--

the number of hits that will be present t
here and then use
other faces that are represented in each object as hypothesis to go test to help
disambiguate amongst that smaller set and then you continue that loop until you come
down to a single one.


So this is a little video of the system doing e
xactly that. So I'll remind you, this is 1994,
when this was created, and it's on a TRW platform with a CRS robot arm. The gripper
has a camera and light source in it. And the kinds of motions that you see it executing
are all control that's a behavior
based system of three different kinds of motions. So it
turns out that you have sufficient number of motions to do this if you allow for motion on
the semisphere that where the objects are at the center. So you can go radial into that
semisphere, you can

go tangentially, or you can rotate.


And those three motions, if structured in the right way, are sufficient to be able to
capture the faces that you need to recognize and to move to other views of objects in
order to be able to do that disambiguation.

It also is able to recognize when it reaches
the limit of its reach so it can move the platform in order to extend the reach, in order to
go around that whole semi sphere over the object. Uh
-
huh?


>>: [inaudible].


>> John K. Tsotsos: Yes. So it'
s a word that appears in medicine. And basically it
refers to a symptom or a sign which makes it clear that it's a particular disease. So here
you want
--

so in other words, if you find this, then it's definitely that disease. Okay? So
the search here
is for viewpoints that tell you for certain that it's a particular object. So
in other words, you have all possible viewpoints when you just toss the objects down in
a jumble, they could be in any kind of pose, but you're looking for the camera pose,
came
ra object viewpoint that will allow you to determine that it's a particular object with
certainty.


>>: So you're basically looking for unique view of the object?


>> John K. Tsotsos: It's not necessarily unique. It could
--

there could be overlaps
.
But you're looking for unique view that would distinguish one object from another in that
hypothesize and test framework.


>>: [inaudible].


>> John K. Tsotsos: Yes. Yes, it does that. So there's some past work that we've
done on which we've bu
ilt, and I just have to say this, this is kind of you know more
political than anything, trying to sell you on the idea of attention or active vision just by
pointing out that an awful lot of work that one sees in computer vision these days tries to
set up

circumstances so that one doesn't need to have attention or active vision. And I
think that when you're looking at kind of really world applications most of the
assumptions are not valid. So things like always having only fixed cameras in our
applicatio
n where we have a mobile platform, it's not a valid thing. Taking images out of
spatial temporal context so we don't need to track again for us, it's not a valid
assumption and so forth.


So there are a lot of situations where most of the current assump
tions that people make
I think are not valid. So let's move on to the main meat of the talk which is looking at 3D
search, and this was the PhD thesis initially of my student Yiming Ye who I dedicate this
to because he passed away a few years ago. So let

me formulate the problem for you
first. The problem that we're looking at is how to select a sequence of actions on the
robot that will maximize the probability that that robot that has active sensing will find a
giving object in a partially unknown 3D e
nvironment within a given set of resources.
Okay? So that's our definition of the search problem. And basically to formalize it
further what we're looking for is the set of actions out of the possible full set of all of the
actions that satisfies two co
nstraints. One being that the total time that it takes is below
some threshold. So in other words, we're not going to wait for ever, we're going to have
a limit on the amount of time. And that maximizes the probability of finding that
particular object
that we're looking for.


Under this kind of a definition, which is quite general, I think, the problem is provably NP
hard. And the Yiming had done those proofs and you can see them in some of the
older papers. So given that a problem is NP hard, you k
now you're not going to have an
optimal solution so one is looking for approximate solutions in order to get as close as
possible to that optimal solution. So we're going to start looking at some heuristics that
will help us in dealing with the exponentia
l behavior of the generic solution.


So the first
--

sorry. So there's a number of other variables that are of interest, and that
is that we have to deal with the fact that the robot has an XY location and an orientation
on the plane. We have cameras,
and they pan and tilt, and I have a depth of field. The
room has a certain size, so this is all we know about whatever space we're searching.
The length, width, and the height of the room, we only know where the external walls
are. And the kinds of obse
rvations that we want this robot to make are two only. One is
they're an obstacle at a given position, and second is the target that I'm looking for in a
given position. So those are the variables.


What we do first is think about how we can partition
the room in such a way and
partition the space of actions as well in order to make it reasonable. So one doesn't
have to look at all possible locations of the robot, all possible angles of view and so
forth. So we're going to have an occupancy grid repre
sentation, so we're going to fill
this room with little cubes and in each of those little cubes we'll have a couple of
variables represented. One is the probability that the object is at that
--

centered at that
cube, and the other is whether or not that
cube is solid in the world. Okay? And we
then will partition all the possible viewing directions. So rather than having the robot
need to consider every possible angle degree by degree, we'll partition it into certain
viewing angles to make that a small
er number. That means that what we have is if I'm
the robot and I
--

and I'm looking around in the world, I've got kind of these wedges that
I see in the world, these three dimensional wedges that I see out into the world and that
creates kind of the stru
cture that we call the sense sphere. So if you look on the outside
of that sphere, everything is at a particular depth of field and it's this little wedge, so it
has all of those images that it can look at. So it's not
--

so it reduces the number of
thos
e images tremendously.


Because there's a depth of field, and I'll explain this a little bit more later on, it really is
an onion skin. So you're not looking at all possible depths, you're looking only at a
particular depth of field that the camera uses
. Now, why is that relevant? Because for
every object that you want to see, it's not the case that a robot will be able to recognize
it from any position in the room.


If it's too far away, there aren't enough pixels, and if it's too close, it's too bi
g. So you
can't recognize it. So you needed to have it within a particular depth of field which is
tuned for every object that you would like to recognize. Sort of you know, dynamically.
So we've restructured the representation of the world and then we
're going to associate
with each one of those actions the probability that we've actually found that object. So
that action depends on the robots position, the current sensor settings and for any set of
actions, a set of actions, the probability of detect
ing the target is given by this, which is
kind of standard Bayesian sort of framework. The conditional probability of detecting
the target, given that its center is given by B and this is determined at this point entirely
empirically. And that's one of t
he dimensions of future work is to get that to be learned
specifically for each particular kind of object.


So it depends, that conditional probability depends on XY location relative to the
location of the cube that we're particularly testing, the actio
n that's chosen, the viewing
direction and so forth. And as I said, this is determined experimentally for particular
objects and future work is going to look at how one learns it in the more generic case.


The reason why we didn't focus on that, just li
ke we didn't focus on having fancy path
planning for the robot and so forth is because we wanted to focus on the actual search
method. How do you determine where to look? So we wanted to test that part without
spending a lot of time on the balance. Once

that part is working, we have the default
situation, we go back and start adding some of these other components. This is
something that I kind of worried about for a while and about a year ago [inaudible]
visited, and I showed him the demos here, and I e
xplained to him that, well, we don't
have this and we don't have that because we wanted to focus on the active part. And
he thought for a second and goes, yeah, you're right, I think I agree with you. A lot of
the groups that focus on path planning get s
tuck at path planning and never leave path
planning to do all the harder stuff. So you've gone that part first. So I felt kind of better
about that. But still it's a weakness of the overall system.


What we have is a gritty strategy. We divide the se
lection of these actions into where to
look next and where to move next stages. So the actions that we can do is move the
robot in XY along the plane in the room. We can
--

another action is moving the camera
so where can you move the camera to? And the

third action is actually taking some
measurements. So we first say where do we look first, and then where to move first.
And there's no look ahead in this system. So where to look next algorithm looks
something like this.


In the default case, let's
say we're in this room, the room is divided into all of these
subcubes. Just for reference when we did our experiments the cubes are five
centimeters by five centimeters by five centers in our examples that you'll see later.
Each one is assigned a probab
ility that the object is present there, given we have no
information about the object at the moment nor its location, everything is equal
probable. All of it sums to one.


So first of all, we calculate some total probability of a region in space that we

were
hypothesising to look at. If everything is equal probable, then everything, every
direction is the same. Looks the same. So you just choose one. You look in that
particular direction, so you have that wedge if you remember. You can test whether
or
not the object is present there. If it's not present, you can set those cubes to probability
zero and update everything else. So after you do that a few times, you run out of places
to look from a particular position. So I look here, I can look here,

I can look here, I can
look here and then I run out of potential options for where to look if I can't find the
object. And then I would decide that well, maybe I need to move to some other object.
And when you see this working, it's kind of like the way

a little kid would do it. Because
if you asked a little kid to find something in the room, they sort of walk around and
they'll, you know, do this and you know, walk around some more and do this and so
forth and that's exactly how the system behaves, als
o.


Where to move next is then determined in a different way. So if I'm standing here and
I'm exhausted all my viewing possibilities from this location, I'm now going to ask myself
well where would be the next best location for me to go. We don't have
any kind of path
planning here, so right now what the robot would do is just look for free floor space. If
you had path planning, you don't have that restriction. So you would hypothesize.
Suppose I move over there. If I'm there, what would
--

what is
the set of viewing angles
that I have available to me at that position? And then what is the sum of all the
probabilities of the cubes that I have access to at that position? And I compute that for
all of my accessible positions and choose the one where
the sum of the total
probabilities is largest. Well, that would be the one where I have the strongest
probability that I would find my object. And I move there, and then I go through the
whole process again. I look at all the viewing angles from that di
rection and then repeat
until I find the object. Yes?


>>: Actually [inaudible] for example you know, you see something close to you and so I
go make a conscious effort to move sideways [inaudible].


>> John K. Tsotsos: No. At the moment there is n
o planning of that sort of
--

no path
planning of that sort of thing nor is there viewpoint planning of that sort. The
assumption that's made in the examples that I'll show you is that there is a free floor
position in the room from which I will see one o
f the views of the object that I know.
That's a limitation, clearly. Adding in planning removes that.


But we didn't do that because we wanted to make sure that we were able to handle, you
know, this sort of case where you know planning didn't exist ev
en. So that's kind of the
default situation, but you're absolutely right.


>>: [inaudible].


>> John K. Tsotsos: Yeah. You'll see that clearly in the examples. So you can easily
imagine that as you move about the room all those probabilities are
changing all the
time. And it actually initially looks pretty simple but by the end of a

--

you know, two or
three positions it's actually pretty complex. Very difficult to characterize in any formal
way.


>>: [inaudible]. Just starting, so that's li
ke uniform [inaudible].


>> John K. Tsotsos: Yes.


>>: And then it's evolving as
--



>> John K. Tsotsos: Correct. Correct. So we've actually taken this and implemented it
three separate times. So the first implementation was what Yiming had done

for his
PhD thesis on the cyber motion platform. This was mid 1990s. And the whole
algorithm worked well. Different sorts of sensors. And from back then we have no
movies, we didn't have tech pictures, people just didn't do that back then. So I don't

have very much examples.


The examples that you'll see running are in the pioneer three platform, the Sonia
Sabena implemented all of this. And it uses the point gray bumblebee their standard
stereo software and a pan tilt unit from directive perception
. So that's all.


This is a wheelchair. It also has reported to the wheelchair. Also runs on the
wheelchair. It can go find an object. It even finds doors and moves the door, opens the
door and passes through it autonomously, all of that. So this i
s the touch sensitive
display that I was referring to earlier on which
--

with which users can be able to point
their instructions.


One thing that I didn't say is how exactly to determine whether or not you've found the
object. So you're making observa
tions. I said two observations. One is it's
--

is the
object solid? That's where stereo comes in. If you get a stereo hit from the tryclops
algorithm, then that is filled. And the other is whether or not the target that you're
looking for is at that p
osition. There's lots of different kinds of object detection
algorithms. We defined our own which works actually pretty well for the kinds of objects
we were initially looking at. Namely solid colored objects, like blocks, kid's blocks, that
sort of thi
ng. So it actually works very well. And it's something that James MacLean did
when he was a post
-
doc in my lab.


And it's based on the selective tuning algorithm of attention that I've been working on
forever. And it's basically pyramid tune to object
s and uses gradient descent to locate
things within a pyramid representation. It works pretty well. It has
--

it can handle
objects as long as they have distinguishing 2D surface. It seems to be pretty rotation
and the plane and scale invariant, handles

some rotation in depth as well, and uses a
sport vector machines in order to determine some acceptance thresholds. The search
algorithm is not dependent on this. And as you'll see, one of the other experiments that
we did, we used sift features in order

to detect an object, because this method doesn't
work well for textured objects. Well, sift didn't work well for solid objects, so we use
that. And basically the algorithm, the overall algorithm could have a family of different
detection methods and jus
t choose which is the right one to use for the particular object
that you're looking for.


So this is necessary but it's not
--

it didn't really define the performance of the method.
Also what we haven't done and which is future work is we need to worry

a little bit more
about that relaxing that assumption that I mentioned earlier, namely that we're assuming
that there is some spot on the floor where the robot will find a view of the object that it
recognizes. So we need to relax that because that's not

a general
--

that's not generally
true in the real world. So we need to be able to include viewpoints of image sizes and
scale and rotation in 3D and so on and so forth. One way we could to that a little bit is
by simply doing some experimental work and

seeing how does the recognizer that we
have function when there is rotation in depth or occlusion? So these are just a couple
of examples of the kind of recognition performance that you get as you increase the
percentage of the target being occluded or a
s you increase the degrees of depth and
rotation.


And this is actually not so bad because you see for occlusion it's not very tolerant, for
rotation an depth it does a reasonable job up to about 20 degrees. So this gives you at
least a little bit of fl
exibility. But it's certainly not ideal. And there's just some other
examples of this. And we actually can build this in to the system in terms of the
tolerance that we're willing to take, you know, whether it be here or here, in order to
decide which o
ne of the recognizers to use.


In fact, what we need to do is have a recognizer that will be able to deal with the full
viewing sphere around the particular object. So we can have a little bit of buffer zone
due to the performance of it as I just showed

you, but in general we need to be able to
determine that I've actually looked at a particular position from all possible angles that
would let me find the object before deciding that I have not found the object. As
opposed to what I told you, namely I to
ok this view, I ran my recognizer, I didn't fine it, I
blanked everything out. That assumption also has to be relaxed. This is future work.


We can put in a priori search knowledge because we also can relax the assumption that
everything is equal proba
ble. So there are a number of different things that we can do
here. Type one a priori is exactly that, everything is equal probable. We can include
indirect search knowledge. This is an idea from Tom Garvey from 1976. Lambert
Wickson [phonetic] when h
e was doing his PhD with Dana Ballard had implemented a
system that had this indirect search. Basically if you can find an intermediate target
object more easily than one you found you do that first and then find the target object.
So if you're looking f
or a pen and that's hard to find, pens are often on desks you find
the desk first and then go find the pen.


You could highlight regions in which to try first. You know, I think I left my keys over
there. So you just put that into the system and have t
he
--

prefer that position first. You
can add saliency knowledge. I'm looking for a red thing and I'm looking for this
particular image. I can't find the object I want but there's a red blob over there. So
maybe for the next image I want to go closer a
nd inspect that particular area. Or I could
have predictions in there about spatial structure or temporal structure and these again
are older ideas, one from Kelly, one from my own work.


We can't make simplifying assumptions about probability distribut
ions, in other words.
So things can be complex if you have some of these a priori search cues. So let me
show you some examples of all of this now, just to make it a little bit more concrete.
The simplifications that are involved in this example is that

we start off with a uniform
initial probability distribution function. I'll show you examples afterwards where that is
changed. We limit the tilt on the cameras to 30 degrees. We have no focal length
control for the bumblebee camera, of course.


We d
on't have the location probabilities being viewpoint dependent, as I showed you.
We're assuming you have a recognizable face of the object somewhere from the visible
free space and there's no path planner. So the room in which this is operating is the la
b
here. The robot will start in this position facing forward. The target is that object back
here. Okay? So basically if I'm the robot is there, the target is behind me.


The total possible number of robot positions is 32, because we're doing this ba
sed on
one meter by one meter tiles on the floor. The total possible camera directions from any
position is 17. With a
--

the field of view that's specified there and the depth of field for
this particular object being half a meter to 2 and a half meters
. So from wherever the
camera is, half a meter to two and a half matters is the region in which the recognizer
we have can recognize this particular object. You'll see I'll show you a couple of
examples of where that changes for different objects.


The

total size of the action set to choose from at any step is 544 actions. So you need
to choose from amongst those and the total number of occupancy grid positions is 500
--

451,200 each encoding two values. So if you're going to ask questions like what's

the
total number of possible states, it's a really big number, and I didn't bother to calculate
it. So it's not something that it's an easy system in order to deal with that number of
states.


So here's the way the example will work. In this view, thi
s little rectangle is the robot.
This is the field of
--

the depth of field and the direction in which it's looking. The zero
here corresponds to the fact that it's horizontal, the camera the horizontal. These little
rectangles are for your use only. T
he robot does not know the positions of any of the
objects. It only knows the positions of the exterior walls. So these are just for your use
in order to be concrete with the system. And the target is over here.


So with this image, this image shows y
ou the probability of a target presence and the
horizontal plane. So in other words, the probabilities of all of the cubes in the horizontal
plane only. Because it's too hard to show all the planes. And these are the stereo hits
where the stereo algorit
hm tells you there are obstacles. Black refers to I've looked
there and I've set those probabilities to zero. The gray scale here shows you the
resulting probability changes. And this is just an imagine of where the robot actually is.
So you can see it

as it's going.


So this is position one. And the sensing action one. The second sensing action looks
at this other angle. So you see now that probability changes. The third you see the
background gets lighter because it gets higher. And the fourth,

so it does all of those,
and then the algorithm decides well, there isn't any other particular view that's going to
be useful to me. In the sense that there is a threshold that has to exceed in terms of
utility of a particular direction of view. So then

it decides it's time to move.


So as I said, the time to move is determined by hypothesizing that I'm in a particular
other position and looking at what the sum of the probabilities are at that other position
that I could see. So this is the map that g
ives you those hypotheses as each one of
those actions that I just showed you is executed. So this is before anything is done.
This is after the first action, after the second action, after the third action, and after the
fourth action.


So you see tha
t after all of those views, those 4 views that it took, it's populated the
world with a number of stereo hits, some correct, some incorrect, wherever you see X
it's determined that I cannot
--

the robot cannot get to that point because it's not a
straight
line free floor path from its current position. And the probabilities of the sum of
the probabilities of the sensed sphere at each position is indicated in the gray scale.


So after the 4 viewpoints from here, it determines that the next best position i
s here.
Okay? So it moves to there and starts taking its views again. So these will be in blue.
So it takes a view there. It takes a view there. This view is at tilt of 30 degrees because
looking backwards it already has seen the horizontal plane. S
o now it tilts the camera
up 30 degrees to take a different view. But look over
--

but over that same area. Over
here. And then has to move again. So after all of that process, it's decided that all of
--

all of this is inaccessible, all the X parts.
All of this is obstacles, so that's not useful. So
of the only free floor positions, that's the only one available to it. That's the strongest
one. So it moves to there. It looks again out here. You see this map is now getting a
little bit complex as
you were asking in terms of how does it evolve. And takes a few
views, another 30 degree view because it had
--

looks back over an area that it's been
already. And then runs out of views. And asks the question again, where is my next
position.


So her
e all of these now are dark because it's already been here and seen all of this,
but it hasn't seen things over here. So that's the next best position. Remember the
target is down here. So it goes over to that position, chooses to look in that direction

first, and then with the second view finds the object. So basically it's had four positions
and at most four views from each position before finding the object. With that being the
result and map of the probabilities throughout the occupancy grid at the

horizontal level
only.


>>: So [inaudible] the uncertainty of its currently location because it's moving and it's
some sort of dead reckoning.


>> John K. Tsotsos: It's all dead reckoning. It's all dead reckoning. If we had
--

so as
you'll see in

the experiment afterwards, if the performance is actually pretty good
considering how stupid the planning and the stereo is. This isn't a great stereo
algorithm either. So the performance actually pretty good. And it probably would be
perfect if we had

a better planner in stereo algorithm. You'll see.


This is an example of what the stereo algorithm gives you. This is just point gray's
stereo straight and there's the target. And this is sufficient to be able to do what I just
showed you. So this i
s the very poor stereo reconstruction by any kind of imagination,
but it's sufficient to deal with the problem of active search.


Let me show you a different example. This is a smaller object. It's rotated from the
position that it was learned by the s
ystem. And it's partially occluded. So it's back, back
here. You'll see it a little bit later. Robot starts off initially the same. So it's a little bit
more difficult.


And this time we've included here the image that the robot actually sees. This

is the
image that the robot has to search from every view. Everything else is the same. So
this time it takes this view first. Notice how the depth of field is different than it was in
the previous example. This is because the object is different. So

this is computed
separately for every object. The object was smaller this time. So it can't be as far
away.


It takes it's views from that position. So wide variety of images that it has to test.
Moves to that position. The object is there, but it'
s too far away and too occluded and
its outside the depth of field so it can't recognize it at that position. Looks through all of
this, again 30 degrees because overlapping. So again if you look at these images it's
really like a little kid. And I've p
resented this to people who are looking at developing
kids and they say it looks like it's exactly what they do, they just sort of walk around and
look for things.


So it's back there, but the robot's not pointing at it. So it's getting closer now becau
se
there's where the object is. And it's missing it from every view and then finally it sees it
at a viewpoint here, but that viewpoint is too occluded. It didn't recognize it at that point
because it's
--

the occlusion is too great for it to reach its t
hreshold for recognition. It
moves to the next position and now it finds it because it has a viewpoint that it actually
recognizes sufficiently.


So again, these are four movements, as most four images at each one of those. So
again, that's a pretty go
od kind of a result when you have it. When we've done this live
to visitors and we just let the system go until it finds it, we allow visitors to put objects
wherever they want, it's a pretty robust system because it doesn't give up and it doesn't
get los
t. So it really keeps going until it finds the object.


At one point dead reckoning comments in, so it fails, so like that part we'll leave aside.
Yes?


>>: [inaudible] the object [inaudible] well, what if I
--

if there's nothing of
--

like any sort

of object [inaudible] then if it doesn't have information, so what's the behavior [inaudible]
in those cases?


>> John K. Tsotsos: You me in the first shot, in the first image?


>>: [inaudible].


>> John K. Tsotsos: In both of these examples it
could not see it in the first image.


>>: [inaudible] why would it be at the right decision after that? And does it have any
evidence for the [inaudible] about where the next
--

so the problem [inaudible] you can't
see anything anyway, right? So what
makes it the correct step in the first place?


>> John K. Tsotsos: So there's nothing that makes it take the correct step, whatever
correct means, for the first several. It's simply exploring and gathering evidence about
what exists in the world. Bas
ically it's saying for the first few steps it's basically deciding
where to not look again. Okay? So that's important. So only once it starts and it really
--

the whole thing is really a process of elimination. It's really
--

I've looked here and I
can
't find it, I've looked here and I can't find it, and I'm just going to keep looking
everywhere else until I final something. So it just eliminates as you go through. So it's
kind of a search pruning process as opposed to something that directs it to a p
articular
position. Okay?


>>: You mention [inaudible] gives up, but if your exclusion map is already include the
host space then it should give up, right?


>> John K. Tsotsos: If it actually covers everything and there's
--

and the thresholds
--

w
e'd have to turn off the thresholds, for example, for it to never give up as well, because
right now we limit it to the point where when it decides on a particular next position, it
needs to ensure that the total sum of the probabilities in that particular

sense sphere is
greater than a threshold, otherwise it won't go there. So it prioritizes. And we turn that
off, then all of a sudden all of the probabilities are in action. Even if they're small.


So it don't give up until all of that stuff is exhaus
ted. That's what I mean. The examples
I showed you take about 15 minutes of realtime. Not optimize, not a fancy computer or
anything. So I'm sure this would be close to, you know, just a couple of minutes if we
really chose to push on it.


There are
lots of other search robots out there of a variety of kinds. There's search and
rescue robots and lots of others that keep searching. None of them do exactly this
because a lot of them try to find optimal paths in an unknown environment and search
while
looking for an optimal path, for example. Or lots of them look for shortest paths
through environments and that sort of thing.


One that's an interesting one that we should compare it to is work of Sebastian Tru
[phonetic] and his group that have looked

at palm DPs for solving a similar problem. So
the natural question that you could pose to me is why aren't we using a palm DP? So
there are a couple of reasons. I mean, first of all, we started this before palm DPs
appeared. So it was, you know, we wa
nted to see this through to the end. That's not a
--

that's not a solid science reason, though. The solid science reason is that the space
we have is more larger than what palm DPs can deal with. We can't make assumptions
for the nature of the probabili
ty distributions within the system in order to simplify things.
And the number of states that we have is much larger. I showed you the numbers of
--

numbers of actions and number of positions and so forth. The kind of things that people
have looked at i
n the palm DP world real look at small numbers of states and
observations. They look at hallway navigation. And that's kind of the largest number of
states that I've seen. Which is orders of magnitude smaller than what we need to deal
with.


So our pr
oblem is just a larger problem. It's also when you think about the solution, it's
also very intuitive kind of a solution as opposed to what the palm DP does, and it's
intuitive in the sense that it
--

when I presented this, people have come up to me and
y
ou know pointed the [inaudible] paper to me that look at ideal observers in human
visual search and showed that there seems to be a good similarity between the way that
we've done it and the way that experimentalists have found people do it. So we're
actu
ally currently investigating whether or not there's a stronger relationship between
our method and the way humans do it.


So we have a couple of reasons for not going the palm DP route that I think are pretty
solid and interesting. We can examine the di
fferent search strategies, though. I've
shown you only one search strategy, namely, you know, I look first here, I examine
everything here and then I move to another location that has high probabilities. There
can be other things. So here's
--

so we did

it
--

decided to do an experiment looking at
four different search strategies. This is the one that was present in the examples that
had just shown you. Explore the current position first and then the next position
maximizes detection probability.


We

could choose something, an action like this, where you choose the action pan tilt XY
with the largest detection probability. So every time you know we decide the best place
for me to look at is over there, I'm going to move there and look at an image. A
nd then
move over there and look at an image. And then move over there and look at an image,
one image, and so forth. So that's a strategy.


We could explore the current position first. This is strategy C. But the next position
maximized detection pr
obability while minimizing distance of the position. So I'm going
to be looking at other places to go but I want it to be something where the probability is
high but the distance is also close. So that I minimize the amount of travel that I have.
Or I c
ould have a strategy D here, where I've kind of relaxed the distance requirement
so it's not so strict. So it's okay to be a little bit far, but not to be, you know, the absolute
minimum.


So we ran an experiment using those four strategies, and we also

wanted to include to
see the effect of prior knowledge. So what we did is the following. This is our room.
We started
--

we had the robot start in five different positions, and we placed this object,
a cup, in four different positions as well. We use
SIFT. If you remember I mentioned it
didn't matter what recognizer we used here. The texture gave us
--

SIFT gave us good
recognition performance because of the texture, so we used that. And we had
--

sometimes we had no prior knowledge, in other words,

exactly like the examples I've
shown you and other times we allowed it to know that the target object is placed on a
table. Not under a table, not over a table, on a table.


And then we had all those different combinations. So for each combination we
ran 20
experiments. We ran the whole recognition algorithm 20 times for each combination of
all of these conditions. So the total number of runs was 160. It found the object within a
hundred
--

for 145 of those, and when we looked at the reasons for fai
lure for the other
15, it was dead reckoning for one because it would get itself lost sometimes. And
unreliable stereo, because sometimes you get stereo hits where there actually are no
objects. So it kind of blocks the system. So I think that if we add
ed those two things
instead of having really dumb methods for navigation and stereo and had better once, I
think we would have very high performance.


And here are the results. For each one of these cases, what we did is measure the
number of actions, t
he total time in minutes, and the total distance traveled in meters.
And these are the four search strategies that we looked at. The top one has no prior
knowledge, this one the target is on one of the tables. Doesn't know which one, but it's
on one of
the tables. And if you look through this, you'll see that strategy C wins
throughout, everywhere. Namely the strategy that says I'm going to choose my next
action by maximizing the probability while minimizing distance to that position.


So in all case
s that wins. So the best strategy is that. And some knowledge is always
better than no knowledge in this case. So suppose
--

so now you have the ability to
decide for a particular robot task what is
--

what is important to minimize? Is it time, sit
the

number of actions or is it the distance? And you can choose which one of these
strategies in order to minimize which of those. So in all cases, choosing strategy C plus
knowledge, that minimizes time for action you have a number of different chooses plu
s
knowledge. A, C, or D. B is the worst one. So you never choose B. And for
minimizing distance it's choosing action C with prior knowledge.


So this allows you then to tailor your particular action, depending on what's important in
terms of minimiza
tion. You're running out of power, you want to minimize time and or
distance, program, that sort of thing.


>>: [inaudible] minimize the distance and the [inaudible] how will you combine the two?
Is there any principle [inaudible].


>> John K. Tsotso
s: That.


>>: Why you use that?


>> John K. Tsotsos: Oh, why? I'm sorry?


>>: Divide by the distance, but what's the
--



>> John K. Tsotsos: It's just a
--

I think probably the most naive way of putting both
together. I'm sure there are ot
her ways of doing it, but that's kind of the most naive way
of doing it. That's what we did for this experiment.


I think that there's
--

I think there's no claim that those are the only strategies that one
can use. I'm sure you can come up with other
strategies. I'm sure you can formulate
them differently. That's what we used and that's the result that we got.


>>: Maybe in something like a resource [inaudible] when you travel long you need a
--



>> John K. Tsotsos: Oh, yeah, I think that there

are lots of ways of dealing with this.
So we haven't had
--

so I think that if you're going to be using this say in
--

so when
Yiming Ye was doing this initially it was for a robot that would be used in a nuclear
power plant. So there you have a differe
nt set of constraints in terms of where you'll
find available power for recharging the robot and, you know, there's a lot of different
other resources that can go in there. So these strategies would change depending on
it.


For our simple case scenario,

what we've done here is show how you can have different
strategies and choose amongst them, how they do have a definite effect on the
performance, and how it's possible to then tailor your robot, depending on what you
want to optimize. That's all.


So
yes?


>>: The numbers you have shown are not the average numbers. I wonder how big are
the variances [inaudible].


>> John K. Tsotsos: Yes, of course it does. And actually I don't know. I think that
--

so, in terms of time, I can tell you that I
can't recall
--

I can't recall it going for more than
25 minutes ever. For anything ever. And I can't recall it ever finding anything in under
five minutes or so. I mean, so I
--

so I don't think the variances, I don't think the
variance in these partic
ular examples would be so large. I think it would actually be kind
of small. But I don't know the numbers. It's a good question. I just
--

if Sonia were
here, she could tell you.


So this is Sonia's, so I want to thank her for doing all of the work.

It was she who did all
the programming on the pioneer. So let's conclude. Have we made any steps towards
the dream? Remember the dream that we initially had? I think we kind of have. We
have a visual object search strategy which we've shown to work a
nd has good
performance characteristics and has several things that we can push on to improve it. I
haven't shown you, but we have a suite of other supporting visual behaviors for the
wheelchair like opening a door and visual slam and so forth, obstacle d
etection by
stereo, loan. So all of these things are all part of the wheelchair system.


We're currently working on integrating all of those into the system, generalizing doorway
behavior so it recognizes arbitrary doors. Monitoring the user, which is
very important.
As I was explaining to Zhengyou earlier, we haven't used
--

we haven't tested this on
users just because we can't get ethics approval at this stage because we can't prove
that it's safe for a disabled user. So we're working on monitoring
the user visually to be
able to detect whether or not they might be distressed in order to be able to stop the
system. And thus try and achieve ethics approval to do more.


And design is something that has to come still. Because the robot, the wheelcha
ir robot
currently looks like a mess as most experimental things do. So we need to have a nice
fancy design like Zhengyou has for his table top system. So that needs to be done as
well.


But from the perspective of the bulk of this presentation, I thin
k that the visual object
search method that we have has shown to be a powerful enough to deal with simple
situations under pretty difficult conditions, difficulty here meaning that we have no path
planning and we have very low quality stereo and so forth.

So by improving those, by
adding in the ability to learn objects and have viewpoint generality, I think we could
improve this to have far superior performance to any existing thing. So if you have any
further questions, I'd be happy to answer them.


>>
: So the one question I have is you don't assume the knowledge of the environment
but you still [inaudible].


>> John K. Tsotsos: Yes.


>>: Which is [inaudible]. I'm wondering whether there's any implication if you merely
assume it's [inaudible] in

your [inaudible].


>> John K. Tsotsos: Well, the implication would be first of all that the reason why we
use
--

where we use the box is to set the number of cubes in the occupancy grid and to
be able to set the initial probabilities. So if you have n
o limits and you just don't know
what the external is, I would say that the first thing that one would do is to have to
arbitrarily set, pretend I'm in a box, or I'm in a fixed area. And I'm going to search that
fixed area because
--

and I'm going to set
the probabilities in this fixed area. And you
don't move outside that fixed area until you're certain that it's not found, and then you
would move to a fixed area. So that's
--

that would be one way to sort of punt on that
problem.


>>: [inaudible] ma
ximize probability, right, [inaudible]. So this assumes that you
already have particular model of probability, right? If this
--

since your active [inaudible]
is based on the probability you cannot assume that you already know, you know, you
have a good
model probability but how about thinking of an [inaudible] and that tries to
maximize the peakiness of the distribution? I'm kind of trying to think of in those cases
when your probability distributions are not good, I mean, your probably making incorrect

moves, so especially in the beginning of the [inaudible] so if you replace this by other
criteria which is trying to sort of figure out that where is the maximum information I can
get, not just about the object, but in general about the whole probability
distribution, like
information [inaudible] and I mean just thinking, did you have a chance to think about
this [inaudible].


>> John K. Tsotsos: Not in this context, but another one of my PhD students who is
now at [inaudible] as a post
-
doc, looked at i
nformation maximization in visual attention.
So he has a model of saliency that is not dependent simply on combining all the
features into a conspicuity map as is most common, but rather in choosing where to
look, depending on exactly where you would get
the most information by looking. So
we've done that on the side of saliency and attention. And that's kind of a precursor to
adding it to this system. So from that perspective, I agree with you, it is a reasonable
thing to try. And we've looked at some

of the first steps there.


>>: [inaudible] but my [inaudible] in the beginning it might give you more [inaudible] but
--



>> John K. Tsotsos: It might. On the other hand, it's hard to say, though, because you
know, it's an environment where you act
ually
--

you know nothing.


>>: It's probably [inaudible].


>> John K. Tsotsos: It's a complex enough problem that you just have to try things in
order to see if it works out.


>>: So compressive sensing has been pretty hot these days. I wonder i
f comparing
with compressive sensing [inaudible].


>> John K. Tsotsos: I'm sorry. With which?


>>: Compressive sensing.


>> John K. Tsotsos: Compressive sensing?


>>: Right. Which is none of that [inaudible] guarantee to have a [inaudible] d
etermine
your search strategy based on the previous observations. So I'm wondering if you could
comment on [inaudible].


>> John K. Tsotsos: I have to admit my ignorance. I don't know what compressive
sensing is, so if you tell me quickly, I'll be abl
e to comment better.


>>: Compressive sensing is a bunch of [inaudible] in order to in order to [inaudible] and
to not have to [inaudible] and as long as you assume some sparsity on the signal, then
you'll be able to reconstruct the signal based on the
[inaudible].


>> John K. Tsotsos: And people do recognition on this other signal? Or do you have
to reconstruct the signal and then do recognition?


>>: There's not much recognition.


>> John K. Tsotsos: So for us, the key is not the size of si
gnal at all, it's doing the
recognition and more importantly how do you decide where to look. That's really
--

that's really question is there's a lot of computer vision work on recognition given an
image and there's lots of work on compression, given an
image, but here we have to
decide which is the image that you want to act on. So the bulk of our work is on
deciding when is that image that you want to do recognition on. It's not simply doing the
recognition.


So from what you just described, I'm afr
aid I don't
--

I don't have enough to be able to
make a better comment. But this is really an exercise in determining how to do signal
acquisition, not transmission, not the recognition, not anything. It's how do you acquire
the right signal in the short
est number of steps in order to find something.


>>: And I think that's probably relates to compressive sensing. So the signal here is
really the proper distribution that you care about, and you go and stop [inaudible] and
then you [inaudible]. Rather

than compressive sensing, you already have a signal and
you [inaudible] where to send it and so that you can reconstruct it. So I guess the
difference is you don't even know the signal and that's what you referring to.


>> John K. Tsotsos: Yeah, you
don't know. You don't have that distribution initially.
You have to create it by moving around. You just don't have it.


>> Zhengyou Zhang: Okay. Thank you very much.


>> John K. Tsotsos: Thank you.


[applause]