>> Gang Hua: It is my pleasure to have Professor Zhuowen Tu here to give a presentation on his most recent research work on vision and [inaudible] image analysis. Professor Zhuowen Tu is from UCLA in the department of neurology as well as department of computer science. Professor Zhuowen Tu has done a lot of work, research work in [inaudible]. He got a Marr Prize in [inaudible] 2003. And recently he also got that -- was that the first prize on

toadspottedincurableInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 3 χρόνια και 7 μήνες)

106 εμφανίσεις

>> Gang Hua: It is my pleasure to have Professor Zhuowen Tu here to give a
presentation on his most recent research work on vision and [inaudible] image analysis.


Professor Zhuowen Tu is from UCLA in the department of neurology as well as
department of

computer science.


Professor Zhuowen Tu has done a lot of work, research work in [inaudible]. He got a
Marr Prize in [inaudible] 2003. And recently he also got that

--

was that the first prize on
a grand challenge on brain image segmentation. And I w
ill leave Professor Zhuowen Tu
to give his great talk here.


>>: Okay, thank you so much, Gang. So, yeah. Today I'm going to mostly focus on the
recent work I did last year and this year about object recognition in general. And the
focus is about the

context, but it's more about modeling and computing.


So context comes from a lot of so
-
called within object parts and between objects,
different configuration. And recently context has become a very important concept in
computer vision or even machine

learning field in general. So anybody guess what this
is?


>>: Australian [inaudible] tiny microscopic organisms.


>> Zhuowen Tu: Yeah, yeah. That's a very good guess. So sometimes it's very hard to
say

--

see. But if I present you with another
thing, you can immediately tell this is a
snake, even the middle part is included. So that's why context is really important. And
then appearance definitely helps you to tell it's Australian artistic [inaudible]. But what it
is? Any guess?


[multiple

people speaking at once]


>> Zhuowen Tu: Yes, yeah. Yeah. That's right. So, again, this says how important
context helps us to understand objects in general.


So some related work, recently people trying to so
-
called conditional contextual models
f
or object recognition also boosted

--

so
-
called boosted the random fields and also spatial
boosts. The things like the people are trying to learn pretty much an appearance model,
then try to combine the context using Conditional Markov Random Fields type
of
framework.


So I'll just go to the problem directly. So now suppose we have observation data,
observation acts as a huge vector. You can stack all the pixels together as a vector. So
now suppose you have a ground truth Y, which is the label for eac
h pixel. So if it's an
object segregation problem, it's 0 for the big one, 1 for the label of interest. If you want
to annotate all the different objects, so, for instance, roof or building, trees, things like in
general, then for each pixel it has multi
ple choices, possibility for different labels, from 1
to K.


So in this field Bayesian approach has been popular. What we want to do is given the
data X we want to estimate

--

compute learn model that the posterior distribution and
pretty much we can se
parate the posterior into likelihood in the prior. That's the typical
thing we do.


But it's really extremely difficult task because first even now there's no [inaudible]
satisfactory likelihood of model out there. You can say

--

I give you this kind o
f roof,
can you give me a precise model, it's hard because it's

--

everything is joined and it's a
texture, all these things. And it's not a homogenous texture. Also it has the lighting, all
these things.


And also prior. Prior about

--

there's a prio
r about different objects, there a prior about
the shape for each individual objects. And if we put them together, it's huge
complicated

--

huge complicated thing for us to put all this configuration all together and
make them into an [inaudible].


So I

pretty much summarized the challenges as two parts. One is modeling. How do you
really learn faithful model for the generous scene, image, understanding problem. Then
another thing is the computing. So even when you have a model, so now suppose you
ca
n overcome or simplify your model a little bit, suppose you have approximated the
posterior or model. How do you compute for an optimal solution? It's still a very
difficult task. So that's the major challenges in computer vision, machine learning in
ge
neral.


And so here we want to put the context into this picture but make the whole computing or
learning framework very simple and straightforward, because there being lots of work in
terms of modeling, in terms of computing Belief Propagation, graph ca
rds, recent

--

many
of the recent theme modeling, MF, MRF model, CRF model, all these things.


But there's major problems with the existing Markov Random Fields model. Belief
Propagation receive CRF condition marker and fields model.


So all these mod
els pretty much used fixed topology, unlimited number of
neighborhoods. There's always pretty much clique with pair
-
wise relationships. The
energy function is very restrictive.


Usually slow, it takes many steps for the message to propagate from just l
ike seeing a
snake head, we can immediately propagate this message to the tail. But for the existing
models, it's almost hopeless. There's no guarantee, no guarantee they found a global
optimal solution, and again the modeling and the computing process a
re separate. Even
you have a good model, you have to engage additional set of computing mechanism to
perform [inaudible] or optimization, which often very slow for complicated problem.


So now actually I can go directly to this algorithm. So suppose th
ese are some
segmentations we obtain. And these are the solutions why and these are the images. So
as we can see, there's huge variation in the foreground and there's huge variation in the
background. But given all the knowledge we have, we can more or
less precisely
segment the object of interest.


So now what we want to do instead of study the

--

studying the whole joint studies given
the image, we want to study the whole joint studies of every pixel in terms of Y. Let's
study the marginals. So [in
audible] what we want to decide how likely does any
particular pixel of interest belonging to a foreground or a background. So turn this thing
into a marginal probability. But, ideally, if you want to really obtain this margin of
probability, oftentimes
we need to integrate out all the rest of the pixels.


For instance, if we were to say how likely this pixel belongs to the horse body, you need
to integrate out all the neighboring pixels because they are in the joint probability space.
So studying this

marginal seems to be even a more difficult problem because of the

--

because of the integration. But we can turn a course around to make the problem simpler.


So suppose we face

--

let's consider a traditional classification problem. I'm giving you
an

imagine and the task is to perform segmentation foreground and background. And
let's say suppose we have a training set. This S has all the image patches, and for each
image patch there's a label. So now each I is a pixel here and we just crop up image

patch. And each image patch has a label associated with it. It's very simple.


So if we do this, we can collect two bags. One is a positive bag, another one is a negative
bag. So positive bag contains all the patches for the pixel in the center on t
he object. So
these are the positive. And you can collect all the negatives or the background into
another background. We can perform any kind of classification in this kind of
[inaudible]. But it's very hard because when we understand [inaudible] segm
ented object
we have all these middle
-
level, low
-
level, high
-
level information in there allow

--

which
allows us to [inaudible] to understand, recognize object. But if you would only read

--

look at the image patch, there is search information. For insta
nce, this head of horse is
very dominant. But the many of the structures are very hard to tell.


So that's why now you can put it logistic regression formulation, but the

--

it won't get
very good results because it's intrinsically difficult to separate

them.


So then auto
-
context is

--

it's an extremely simple algorithm. Suppose now we have a
training image, we have the input image, you have the ground truth label. You can
perform first level of classification on image patch, your favorite classifie
r, your favorite
features. You obtain classification map like this. So as I said, for certain structure
probably it's dominant, but many times it's fuzzy.


So what we can do here actually is now at each pixel we have an image patch you can
compare the
features, the patch appearance. So the question is how do we combine this
kind of so
-
called prior on the configuration of different objects or even the single object
for different parts. So for any pixel here, you can shoot out many rays and in your
clas
sification scheme you also collect the pixel

--

the probabilities on these rays or you
can even put the hard, the gradients on the probability map or classification map itself
and you obtain a second layer of classifier.


So in the first layer of classif
ier is trained on appearance only, then the second layer

--

so
once you have a probability map output from the first layer, so now you acclimate your
classification feature poll, not only you'll have the appearance features now, you also
have the probabili
ty map or the context from the last times, from your classification map.
So definitely the probability is going to be in your hands.


And so for

--

the blue dots are all the candidate features. These red ones are the features
picked by the classifier.

So it's like we're trying to build a dynamic graphical model. In
the traditional graphical model, the connections is very limited. But here we allow the
algorithm first to [inaudible] learn what are the supporting features or the sites. These
sites can

be either near the current interest or it can be far away. So the message if
something indeed is really dominant or informative, the message can be immediately
propagated from other sites.


And also we don't throw away the appearance feature. Appearan
ce features are also over
there, just up to the learner to decide whether the appearance is more informative or the
context is informative. Yes.


>>: Do you use a classifier, a classified result from the [inaudible]?


>> Zhuowen Tu: Yes.


>>: [in
audible] classifier [inaudible]?


>> Zhuowen Tu: Yes, yeah. So this is the classification map of the first layer of
classifier. Purely on the appearance. Yeah.


>>: So many times when I think of context, I think of context as using information abo
ut
the objects around the object you're trying to recognize.


>> Zhuowen Tu: Yes.


>>: Because here you recognize that this is grass which makes the horse more likely and
that sort of thing. So and but there's a lot of different definitions of conte
xt.


>> Zhuowen Tu: Yes.


>>: Is your definition of context, then, context within an object? So I want to use the
context of the horse's legs and the head to then help the

--



>> Zhuowen Tu: Yeah, that's a very good question.


>>:
--

[inaudible]

is that more in line with what you're defining context as?


>> Zhuowen Tu: No, no. So that's a very good question. So context, as I said earlier, it's
just like I described, context most time are from between or different objects as a context.
But o
ne major theme for contexts also is for single object, the context could also come in
different parents. It's the mechanism I'm going to talk about later. This mechanism does
not distinguish between either within or between. It's all up to the learner t
o decide.


And what if we have multiple labels, I'll show that what should have is just multiple
classification map on different labels. And also the algorithm that the classifier

--

of
classifier [inaudible] with all these different contexts from diffe
rent labels. And then it's
up to the learner or classifier to select that. So everything's automatic. And so if

--



>>: [inaudible] you pick up on the grass.


>> Zhuowen Tu: Yes. Yes. Yeah. I'm going to show you later.


>>: [inaudible] the fi
xed scale? I noticed [inaudible].


>> Zhuowen Tu: That's a very good question. So I'll

--

once we go through, I'll talk
about that. But here, at this moment, you can consider it's fixed in terms of scale. But
later on you can add a scale information.

But once we go to the

--

more or less the later
part of the talk, you will see. Yeah. That's a very good question. Yeah.


>>: [inaudible]


>> Zhuowen Tu: Yeah. So here it's very hard to say because for any pixel it shoots out
these rays, right?

So there's no kind of within or different. Because here there's only one
classification map. But if you have multiple classification map, you will say if the sky is
dominant, then these things will help to support. That's for sure.


But so for multip
le class

--

so the local part, local like within class thing helps you
support it with local ambiguity. But for multiclass objects, sometimes if one particular

--

I'll show you later

--

one particular type of object is really dominating and supporting thi
s
decision, then these things are often being picked up by the classifier.


So then you can see the

--

so then next level of classifier will output another classification
map. You can see the map has been greatly enhanced. And you just train another la
yer
of classifier. Now, you dealt with them, actually chose different set of a supporting
classifiers based on the current one. And this is pretty much after four iterations. You
can see the classification map has been greatly enhanced because of all th
ese supporting
things selected by the

--



>>: When you say duration test, is that

--

and it looks like a layered system.


>> Zhuowen Tu: Yes.


>>: It's really layers.


>> Zhuowen Tu: Yes. So you just train

--

it's not a constant. It's a series o
f

--

a
sequence of classifier. So iteration meaning just a couple of classifier. So in one sense
you can consider its performing message propagation. But here the message propagation
acts in a much more efficient way than a traditional Belief Propagatio
n because it's

--

first
you don't need to compute the integration explicitly. You just compute the logistic
regression, so it's closed form. And then second is all these supporting things are much
more informative, and you can afford to have 1,000, all t
hese supporting context, but it
still can be computed within like milliseconds.


So what the algorithm is doing actually is trying to learn or compute a marginal
distribution, conditioned on the previous classification or probability map. So what we
wan
t to do is by doing this, we're hoping that we can approximate the whole true
marginal distribution, but without doing this explicit integration.


So the features then you can compute, the appearance features on all these gradients, even
like at the face

of the texture, like [inaudible] features can be used. And also context.
You can let the algorithm to select very large number of context information in this
framework, and it's extremely simple.


But actually I said of the first the classifier is dif
ferent from the second one. At this point
I can take back this argument. First class of [inaudible] is identical to the second one.
The only difference is in the first classifier, the classification map is a uniform
distribution. So now you don't have
to distinguish the first one or the later ones because
everybody shares the identical procedure in code. The only difference is you start from
initial map.


So but this is

--

for this particular application, in some other domains, you can know
roughly w
here the foreground appears. Then you can input the system, give the initial
system with an initial classification map. Then the algorithm starts to justify, refine all
these things. So it's

--

again, it's extremely simple algorithm. You can see the
cl
assification map are being enhanced. If there's a wrong thing, the probabilities are
being suppressed. Then if there are these weak things, you can see the probably maps are
being enhanced.


>>: Trying to get a picture here. The classifier is only bi
nary, yes or no, it's not part of
the horse, or is it sort of knowing every part [inaudible] where it is on the horse so it
expects some other part to be at another spatial position? Or what is happening here?


>> Zhuowen Tu: Yeah, that's a good questi
on. So there's this explicit and implicit part
type of thing. Here there's no explicit notion of where are the parts. It's kind of good or
bad. Good thing is in the annotation stage there's no confusion, you just say yes or no.
It's up to the learner
to select. But the better thing is this kind of information is implicitly
buried. Yes.


>>: So conceptually if I shuffle the pieces here, the support that a part gives to its
neighbors is just that it's present, it's not how it is related to
--

I can
turn the horse upside
down and it would still do the same thing.


>> Zhuowen Tu: Yes.


>>: [inaudible]


>> Zhuowen Tu: Yes. Yeah. Yeah. But so the classifier is trying to decide

--

the
decision is 01, but the features are probabilities. So it's

the confidence. It's not a hard
decision.


>>: But one plays [inaudible] in the sense that are there other horse parts around

--



>> Zhuowen Tu: Yes.


>>:

--

that don't care where they are, they're [inaudible].


>> Zhuowen Tu: [inaudible] yes.

Yes. Yes. Yes. That's right.


>>: Do you have a different classifier for every pixel in the image?


>> Zhuowen Tu: No.


>>: So it's the same classifier?


>> Zhuowen Tu: Yes.


>>: Okay. And it doesn't depend upon a spatial relation? Like

the classifier

--

you have
some things marked in red and blue. Can you describe what the red and blue mean?


>> Zhuowen Tu: Yeah. The red ones are picked. The blues ones are

--

blue and the red
are the candidate features or candidate context. The a
lgorithm does not pick them all

--

the classifier does not pick them all.


>>: If you're red, it doesn't care which red one you are; it puts all the red ones into one
big bucket.


>> Zhuowen Tu: In one big product. Or also there are different weight
s for them.


>>: Okay.


>> Zhuowen Tu: Because the classifier itself, it's more or less a tree, so actually these
different locations get automatically sent to different branches of the tree.


>>: [inaudible] two kernels here, one of them is the k
ernel on the patch, the other is the
kernel on the context of the weights for [inaudible].


>> Zhuowen Tu: Yeah. But in the end for the classifier it does not [inaudible]. So it
just

--

because when you feed the classifier with all these features, thes
e features

--

so
appearance feature on all these, context features, you just line them up. So you don't
separate them in the training.


>>: So how come these two red squares are the top of the last? I mean, I would have
thought that it was only suppor
t for [inaudible] would be on the very bright areas that you
can [inaudible].


>> Zhuowen Tu: But you also have the background support. Because every picture
shares the same classifier, so sometimes you want to suppress the background.
Sometimes you wa
nt to enhance all these things. It's not necessary just off from the
object body.


>>: [inaudible] negative weights.


>> Zhuowen Tu: Yes. You get kind of negative weights, yes.


>>: But red could mean a negative weight or a positive weight?


>
> Zhuowen Tu: Yeah. Yeah. So

--

yeah. Yeah. Because it's not only on the
foreground, also on the background.


>>: If there's no spatial information involved, I could have understood that if the belly
features in the center, then it would expect bac
kground to be high up in the sky, no more
horse, but how come it knows that now if you're saying it doesn't use the spatial
information?


>> Zhuowen Tu: So the spatial information is from all [inaudible] so each pixel here,
each dot over here is one par
ticular feature. So the spatial information is implicitly
carried. It's not like a, oh, I want a check of three meters away there's something. It's
here. So I check based on my point. If that feature is selected, then I'm going to check
the probabilit
y of that guy. So the spatial information is still there. It's really important.
But it's not just explicitly carried as saying oh, go

--

you measure the probability along
this ray and put something like that. So it's not explicit.


>>: Are you worr
ied about

--

he's using the same classifier for every pixel and therefore
like you would want a different classifier for like the feet of the horse, say, than the upper
back, right? But you use the same classifier.


>> Zhuowen Tu: Use the same classifi
er.


>>: And it works.


>> Zhuowen Tu: It works. No. So the

--

in the talk I was trying to separate the type of
classifier I'm using from the problem itself, from the modeling problem itself. So what I
did realistically was I had this so
-
called [i
naudible] but you can use your favorite one.
It's like the classifier itself trying to classify the multimodel kind of class of thing. It's not
like SVM you have weighted some of everything all together. It's pretty much a decision
tree. So when you de
cide all this, it's automatically embedded in the classifier

--



>>: [inaudible]


>> Zhuowen Tu:

--

yes. Exactly. So all these different locations are actually in different
branch [inaudible] the trees. So it's automatically down by the algorithm.


>>: Okay.


>> Zhuowen Tu: Yeah. Yeah.


>>: I see you used the old favorite [inaudible].


>> Zhuowen Tu: Yeah. But you can use other

--

yes. Yeah.


>>: So the only invariance that's built into this a priori is just a translation invariance.

Any other invariance you have to get from your [inaudible].


>> Zhuowen Tu: Yes, yes, yes. Yes, yeah. And also you can put the position as a
feature as well. The position of each pixel as a feature as well. That helps in some
particular domain wher
e you know the structures roughly. Yeah.


>>: [inaudible]


>> Zhuowen Tu: For this one I did.


>>: [inaudible]


>> Zhuowen Tu: So this is the result on the test images. So these are from the so
-
called
Wizemann horse dataset. So, first of all,

the input image is step one and these are the
final results. As we can see, the results are much better. And also these are the

--

almost
the worse results on this dataset, on the test image. Even on these we can see this is a
really dotted kind of tex
ture, horse. But still results are not that bad. We pretty much get
the best results on this. And it's really fast. Usually takes about

--

I think for this one
takes about 15 seconds or 10 seconds.


>>: How many [inaudible] in the training set?


>
> Zhuowen Tu: Oh, it's 128 images.


>>: [inaudible] picture without a horse, would it [inaudible] a horse or would it have it
be all black?


>> Zhuowen Tu: That's a good question. I'm going to show you.


>>: Okay.


>> Zhuowen Tu: Yeah. So th
ese are some comparisons in terms of the asymptotic
behavior of the algorithm. So first is to compare. So first iteration you start from image
appearance only, then this context start to kick in. You can see the higher the [inaudible]
its F value. So t
his is tested by the auto
-
context. This is training arrow. So we're not
doing too much overfitting. The algorithm pretty much asymptotically kind of keep in
the same thing.


And one question about context is people asked what if I just add all appeara
nces

--



>>: Can you remind me what F value is [inaudible].


>> Gang Hua: F value just says

--

let's say you can consider it's average between
precision and the recall, detection

--

yeah. So the higher the better.


And one thing people want to say,
instead of computing all the probabilities, just enhance
or enlarge my appearance feature poll. Instead of looking at all the probabilities but
based on previous, I just look at appearance on all these locations. So you can consider
the appearance as a c
ontext directly put them into classify. But it was not that good, so
once you do that, you pretty much end up with overfitting because it's very easy. So now
if we have 1 million features, it's very easy to perform the classifier but it's hard to be
gene
ralized to the test image.


So that says actually these are the things I did [inaudible] as appearance feature and the
results are significantly worse than using this kind of context type of procedure. So these
are the results on the horse metadata. An
d the curve actually does not tell too much
good

--

so this is our recent result. These are the results by the Berkeley people. Was
way above. But from

--

but actually this is also due to the measure. This F value in
terms of foreground and background.

The results actually get much enhanced, but in this
particular dataset, seems like a

--

it was not showing

--



>>: Is this precision recall yes
-
no it's a horse per image or per pixel spatial [inaudible]?


>> Zhuowen Tu: That's right. Because you ev
en put the black box over there, you'll still
get likely 70 or 80 percent of the accuracy there of F value. So that's why in this
particular one it's not that illustrative.


And also I did several interesting experiments to test it. First is

--

and now
, remember,
we have the appearance, we have the probability map, you put them together. So one
question is in the second layer, what if I just learn

--

you use features from the probability
without appearance. So now before I have the appearance, the sec
ond classifier still use
the appearance. But interestingly, 90 percent of features selected by the classifier are
mostly from a probability map. The appearance ones become less important.


But now let's completely remove it. So these are the different

results. So this is the
regular auto
-
context. This is

--

so this is the one without appearance. So starting from
the second layer. No longer the appearance features no longer there. So it's still worse.


>>: [inaudible]


>> Zhuowen Tu: Yes, yeah
.


>>: [inaudible]


>> Zhuowen Tu: Yeah. Yes. So you can see having appearance layers is still a good
thing. That sends one message I want to sell here is in the previous study or papers
people pretty much separate appearance from the prior

--

all
these things. But here we
really want to integrate them together almost seamlessly, because it's really hard to tell
when is a good time to throw away there appearance

--

the whole modeling part, we
should keep them together if we can.


So another test
is after the first classifier is trained and after the second classifier is
trained, instead of training another one, I just repeat the previous

--

the previously trained
second one. So these are the results. So if you would just repeat, you see the algo
rithm is
still [inaudible] but this one probably drops down. So this means that at least at this level
it can push it up. But in the end it drops down. So basically means that probably it's not
generalizing very well.


But you can

--

if you do this

ac
tually still can improve the results. So if you cannot
afford too much training time, you can just repeat from this start. It's like now before we
have a dynamic graphical model for the algorithm to choose. But here we fix this model
once it's trained b
y the second layer of classifier, just fix. So the algorithms

--

the
performance can push that up. But it's not ideal. So that goes back to your question. So
all the images are from this whole particular dataset. I just go to Google, type in horse
ima
ges and you get these things.


So these are the results. So and particularly interesting is this example. You can see all
the horses are facing to this direction, so you can call this either a failure or success.
Failure meaning this small horse is no
t segmented out. Success, you can see almost
carved out, the algorithm almost carved out the small horse.


And also in the training there are only one horse in all these trained images and all here
we have multiple. And these are the trees which are pr
etty much black. You can see
they're not segmented out. If you were to run a traditional segmentation algorithm, you
will have all these junk type of thing over here. But this is just illustration to test how the
algorithm is able to generalize, and it'
s not really overfitting a particular dataset.


>>: [inaudible]


>> Zhuowen Tu: Yeah.


>>: I mean, can you try like a [inaudible] horse can handle it?


>> Zhuowen Tu: Yes. But that's

--

just I want to show. It's an interesting thing to show.
It's able to carve out. Because that's why I'm saying it's either success or failure. But if I
would really build a course detector probably, yeah, should, yeah.


>>: I mean, you could also look at this as what you're learning is a not grass detector.



>> Zhuowen Tu: Yeah.


>>: Just to be

--



>> Zhuowen Tu: Yes, yes.


>>: So it's kind of hard to tell

--

do you have some images which aren't grass? I mean,
it's like

--

which are a little more different to see, kind of get an intuition for is
it learning
what a horse is or is it learning what grass is not?


>> Zhuowen Tu: Yes, that's a good

--

probably these mountains are not grass. But, yeah,
that's a good point. Yeah. Pretty much still

--

yes. Yeah. Yeah. Yeah. I agree with
you.


>>: [inaudible] try some of the [inaudible].


>>: You put a person.


>> Zhuowen Tu: Yeah, I put a person.


[multiple people speaking at once]


>> Zhuowen Tu: Yeah. But all these are very good, yes, suggestions to test.


So the convergence

--

I

probably will skip the details. The convergence is a guaranteed
in terms of training because every time we are guaranteed to [inaudible].


>>: [inaudible]


>> Zhuowen Tu: Yeah. So

--

and also now goes back to David's question. So now
instead of sin
gle object, now what if I have parts. So I collect all the data. At that point I
was not getting too much help from the graduate students and I did everything by myself,
annotate all the parts: head, body, torso, feet, all these

--

legs. So there are 1
4 body parts
over here.


So these are

--

so I collect I think 120 images from Google, and this is the separate
dataset from Berkeley, is the test images. So these are the results by the first layer only
based on appearance. As you can see, the body is
pretty much there but still very fuzzy.
Head

--

we almost completely lose the head. And the feet or thigh, all these things.


So the second layer started to pick up it because once you have kind of rough information
about the body, then using the body
information have [inaudible] coming out. And so
here again, as I said, the algorithm is very general. So it does not distinguish between
two
-
class problem or multiclass problem, and you just stalk all the classification
mappings up to the learner to deci
de. So now between objects context become important.
And they're being picked up. So this is the F, the fourth layer of classification. We can
see head is pretty robust, body is there, feet actually are almost all picked up. But some
of the thigh miss
ed it. So this is more implicit kind of thing. So next level I'm pushing is
more explicit to carry information as a context to better

--

but this pretty much gives you
a picture of how this thing can be done. It's really fast compared with a traditional

kind
of body kind of segmentation; just takes seconds to perform.


>>: And so here [inaudible] one classifier

--



>> Zhuowen Tu: That's a good question. So again I'm putting this whole multiclass
classifier as a black box. You can choose your favor
ite one. But I actually

--

recently we
did this so
-
called data assist output code. You don't have to have so many of classifier,
you just have a coding strategy. I'm going to talk about it shortly. And then you only
have to

need to have log to the K, a

number of classifier, instead of [inaudible] if you're

--

let's say imagine if we have 1,000 [inaudible] can't afford to trade 1,000 classifiers.


So this is a more recent kind of 24
-
class MSRC dataset. They annotate 24 classes,
objects in all differen
t background. So these are the results by the algorithm, so sky and
person, all these things. But the boat is okay, but some is confused with road, water. So
this is a little bit confused, sky. But street scene actually is pretty good. Road, cars, sky
,
these are the really good classes of things and things like that.


So this is the confusion matrix. At that point we pretty much got the best of score on this
dataset [inaudible] at that point it was 72.2. And another thing is our confusion matrix is

much better than the existing one at that point because we're making reasonable mistakes.
So if something was confused with cow, it's confusion between cow or sheep, not like a
cow versus car, things like that.


So these are some new extensions we did
recently. So then

--

so this is so
-
called a new
multiclass classifier data assist the output code. So you speed up the training and testing
process drastically. Then we also have the scale
-
space approach. And then also
region
-
based voting scheme. For
that one, for the last

--

this year's CEPR report, takes
about 30 to 70 seconds for this multiclass labeling. So then the region
-
based speed up
drastically.


Okay. So yeah. This is a new multiclass classifier. So now suppose you have 100

--

or
classe
s of labels. If you were to do a one versus all, you have to compute, learn 100
classes or classifiers. So here we did a so
-
called output code type of thing. So each
class, they're different bits. Each bit represents a classifier. So for each bit it r
andomly
give

--

yeah. So the existing work pretty much gives a random coding bit to different
classes. So if you don't have any redundancy, you don't have any error correction
capability, you only have log 2 to the K number of bits or classifiers, but al
l the existing
work does not look at the data. So now suppose we have a four
-
class problem [inaudible]
pretty much as a random kind of coding, strategy.


So if we do random, you will behind these two classes into one, and this is your

--

let's
say your
foreground and background. Then your classifier will have much

--

more
difficult time than looking at the data. So this new thing looks at the data by assigning

--

when you assign the bits, you look at the data first, and then ultimately your classifiers

are much more reasonable.


And also it balances between the training and the computing time. If you were to do one
versus all, you have to have so many of the classifiers. But here you only have log, so
overall it's much better the other one versus al
l the existing ones by looking at it there.


>>: When you say look at the data, are you looking only at the first layer, at the results
from the first layer [inaudible]?


>> Zhuowen Tu: Oh, okay.


>>: [inaudible]


>> Zhuowen Tu: No, no. So I s
hould clarify. This one. This is

--

it's the standard
[inaudible] machine learning algorithm. Has nothing to do with the auto
-
context. You
just consider this is a multiclass classifier. I give you the features, you just obtain a
classifier.


>>: So

you want to just do some first layering?


>> Zhuowen Tu: Yes.


>>: You're using just the features, just the patches?


>> Zhuowen Tu: Just

--

yes. Yeah. Yeah. But

--



>>: Because one could imagine doing the same thing but with higher [inaudib
le].


>> Zhuowen Tu: Yes. Yeah. Yeah.


>>: And have you tried adding more than log N bits but still many less than

--



>> Zhuowen Tu: Yes.


>>: Because I would imagine it would get much better if you would

--



>> Zhuowen Tu: Yeah. Yes. Yeah.

That's what is

--

this figure is talking about. So
you can see

--

on the X axis is the number of bits. So up to this point you already can
classify them. So if we add up more, the first couple of bits will drastically improve the
result. But afterwar
ds it elaborates kind of things. It depends. Also we almost tried ten
UC Irvine machine learning datasets. So we want to get this thing clear without looking
at the image data, because that's the good thing about machine learning, we work on
some toy da
ta, but then we get the understanding of how the behavior is.


It varies on different datasets. Sometimes for one particular dataset it's better. If you
push more, it's going to be better. Sometimes pretty much it will average. But in general
adding
error correction capabilities always helps. Yeah. But you need to balance.


>>: [inaudible] doesn't that

--

even though you add a [inaudible] is that [inaudible] get
performance?


>> Zhuowen Tu: No. After a certain stage I think it's not going to
improve too much.


>>: That's probably something about the intrinsic difficulty of the [inaudible].


>> Zhuowen Tu: Yeah. But, again, this is a separate thing. You can consider just a
traditional

--

or just a machine learning classification has noth
ing to do with image. But
just

--



>>: [inaudible]


>> Zhuowen Tu: Yeah. Yes. Yeah. Yeah. That's right. And we speed up the training
and testing process largely.


So then talking about the scale. So now because there are a very limited number
of
training images in there for each class. And if we do it only patch based

--

a fixed patch
size, patch
-
based kind of approach, then we throw them together. And if we let's say
consider as a two
-
class problem, it's a dog or a sheep, you can see for som
e dominant

--

if
the scale is really right, then these kind of structures are far away. If the scale is not right,
then these patterns are confusing. This means this is something about so
-
called margin
for the classifier. So if you are at the right scal
e, then the margin is big; if you're at the
wrong scale, the margin is small.


So then we want to design this algorithms to automatically look for the scale. So there
are two options. One is so
-
called doing is this so
-
called multiple instance learning.

Each
pixel you have multiple instance and you consider the back but the only right one is the
correct one. So it's weakly supervised algorithm; you need to look for the specific
class

--

scale.


Then there's another thing here. I did the very simple

--

we did the very simple thing.
You just

--

for each pixel, you collect different scales and throw it a classifier. The
classifier automatically trains. Then in the testing, you also compute different scale. So
for those dominating scale it would tell
, for those fuzzy scale along this decision
boundary, the decision is weak anyway, so then doing this way we handle the scale
invariance problem better than before. So does this answer your question?


>>: Well, what do you get actually [inaudible]?


>> Zhuowen Tu: Yeah. So you'd only get a specific [inaudible] but once you train a
classifier, then you have your classifier. You test it on different scales. The goodest
scale is far from the margin. This means it's high confidence. The baddest scal
e are
fuzzy, so it's a long let's say probability

--



>>: [inaudible]


>> Zhuowen Tu: Yeah, yeah.


>>: But then how do you

--

what

--

when you say classifier, you mean

--

I thought you
were trying to determine what [inaudible].


>> Zhuowen Tu: Yes
. Yeah.


>>: So at the end, the classifier gives you a number?


>> Zhuowen Tu: A [inaudible] number, a confidence.


>>: But I thought of the scales given [inaudible] images [inaudible].


>> Zhuowen Tu: Yes, yeah. So that's why for each image, f
or each site you try

--

for
each pixel

--



>>: For each image you [inaudible].


>> Zhuowen Tu: Yes, yes.


>>: [inaudible]


>> Zhuowen Tu: For each pixel.


>>: For each pixel.


>> Zhuowen Tu: Yes. Yeah. Not for just

--



>>: [inaudible] c
hoose the maximum response [inaudible]?


>> Zhuowen Tu: Yes. Yeah. Yeah. So there are two

--

a couple of options we try.
First you can compute what is the dominant or the scale which gives you the lowest
entropy. The lowest [inaudible] or you can j
ust vote them together, put them together
and let them carry the uncertainty.


>>: [inaudible]


>> Zhuowen Tu: Yes, yeah.


>>: [inaudible]


>> Zhuowen Tu: Yes.


>>: [inaudible]


>> Zhuowen Tu: Yes.


>>: [inaudible]


>> Zhuowen Tu: Yes,
yeah. So these are all the options and

--



>>: [inaudible] different scales of [inaudible] each other if you do this? It seems to me
like you're picking in this case an optimum scale that you're not necessarily using
multiscale in the richest possible
way or a small feature and a large feature can work
together to

--

am I mistaken?


>> Zhuowen Tu: No. No. We do not explicitly do this, but all these small scale
implicitly carried over there, because in the end the best strategy actually is voting.


>>: Oh, okay. So you do a dump using a mixture of scales

--



>> Zhuowen Tu: Yes.


>>:

--

[inaudible] the next level up [inaudible].


>> Zhuowen Tu: That's right. Because we tried to obtain the smallest

--

the patch with
the smallest scale with

the smallest entropy. It helps. But not that significantly. So what
we did is we just compute all the different responses from the scale, just average them
together. So it does have different scales now in a

--

because sometimes if you have
dominant,
this one will dominate. But you also get some help from other scales.


>>: So you're taking the average response of all the different scales.


>> Zhuowen Tu: Yes.


>>: So then

--

okay. So I was going to ask you if each patch independently chose
its
scale, but if you take an average, it doesn't really matter, I guess, so...


>> Zhuowen Tu: That's right.


>>: [inaudible] average better than [inaudible]?


>> Zhuowen Tu: Yes. In many ways.


>>: [inaudible]


>> Zhuowen Tu: That's why re
cently [inaudible] learning takes off, then many of the
alternatives [inaudible] learning really helps. Yeah.


So then we do another thing called the region
-
based voting. [inaudible] so in the
previous one, we just scanned each pixel and compute all th
ese things. It's faster than the
existing one is still

--

one minute is something I'm not happy with. So what I

--

we did
was we do a simple kind of a [inaudible] or makeshift type of class

--

color
-
based
segmentation. It's 0.1 second. Then you obtain
some region. So now you don't have to
scan all the pixels, you just scan

--

subsample, let's say, 5 percent of the pixels, again
going through the scale space, obtain the probability map.


And now it's much faster and also the boundaries are much more a
ccurate. You can see
this is the result by the previous auto
-
context, [inaudible] get the whole body, but you can
see it gets dilated. But here, the segmentation, the boundaries are much more precise. So
this is a recent work

--



>>: [inaudible] cases

where the color is

--



>> Zhuowen Tu: Yes.


>>:

--

segmented really well.


>> Zhuowen Tu: But that's one thing. For many of the color or for the real images, using
color information, if you do a little bit oversegmentation, the segmentation bound
ary is
very precise. You just have oversegmentation. So the problem is how to do a grouping
to combine them. So we do have a good

--

a really a help, so you can see the cow
segmentation [inaudible] than before.


>>: [inaudible]


>> Zhuowen Tu: Yes.

Yeah. Yeah. To speed up. So this is separation to [inaudible] so
in terms of significance, we have three pieces, a scale space, multidata assist, a
multiclass, and also this one. In terms of significance, it's less standard, the origin of
auto
-
contex
t work. But helps improves the result improve speed.


So then later on this algorithm is directly illustrated on the brain image segmentation.
Now, we have a 3D volume. Each voxel has intensity value over there. And the task is
to perform brain segme
ntation. What we want and pretty much our lab's dream is to
make a buildup or bring Google or have brain image analysis engine like you're doing a
blood test. When you do bloods, they take a drop of blood, they do an analysis. Here if
you just go throug
h the MRI scanner, obtain a 3D MRI image, you have different
structures segmented. And then you can study their shape, their different relationship
with the cortical thickness, all these things. Then you can

--

you will know what is your
age based on the

brain, what is your

--

all the disease or whether it's HIV infect,
Parkinson's disease [inaudible] all these things.


So then for brain 3D segmentation, manual indentation, obtaining manual indentations is
even harder because we can only look at one par
ticular slice. So it's hard for the humans
to obtain consistency, even for human. Because you see if they manually segment a
structure from this corner view and from the sagittal view it's bad because the

--

you just
make them smooth over here, but from
the other angle they're not. So even for humans
it's a very hard task.


But the good news for this particular domain is the structures are more or less there.
They're

--

if you don't have hippocampus, you're dead or alien, so the problem is much
better
. So we almost used identical procedure. And the only difference now instead of
putting 2D features you have 3D features. That's it.


So this is the task for segmenting the call date, and this is grand challenge competition.
They have brains from dif
ferent machines, Siemens [inaudible], then they have
[inaudible] then they ask you to train in the segment.


And recently we did the task on segmenting a hippocampus. Actually, this curve shows
our results actually more consistent than human subjects in

terms of results, ground truth
label. So in 3D we are even outperforming human subjects because of this constrained
domain. In 2D we're hopeless. It's far from being over [inaudible] to outperform
humans. But in 3D, in this particular constraint of do
main, we're able to outperform
human subjects.


>>: So how do you determine that [inaudible]?


>> Zhuowen Tu: That's a good question. We perform like a consistent measure P value
on in terms of predicting the disease. That's the ultimate

--

so no l
onger here, it's no
longer based on ground truth segmentation, it's based on predicting the disease.


And also we recently did a study on the monkeys or what they called [inaudible] studies,
a genetic study, so they have a whole zoo of monkeys somewhere
in Los Angeles, I don't
know where they get it, and it's not the Los Angeles zoo, but they have like a mountain
there, they captured I think around 300 monkey

--

no, more than that, 800 monkeys.
They hire a truck. And then they rent a truck, they hire a
person, then they put the MRI
scanner in the truck and they just scan the monkey brains.


So then they study this inheritage or genetics, all these things. And our results are really
much more significant, were much better than results by humans because

graduates
students

--

well, even some of you [inaudible] we cannot guarantee consistency. We get
tired oftentimes.


So then we do a full brain image segmentation for 50 structures and now we have this
so
-
called pipeline is up in the Web. Everybody is
welcome to download. And it's being
widely used in the neuro image domain. So it's a real product. But it's not like a
company product. We're still research oriented. But it's being widely used by many a
neuroscientist to

--

because they have a scan,
they just want to segment the structures.


So this is just a little illustration for the Belief Propagation in terms of MRF and CRF.
Again, as I said, Belief Propagation needs to propagate a message very slowly to the
point of interest. And the models
are very limited. Because we're forced to perform
integration. And that's why the modeling can [inaudible] of the existing [inaudible] MRF
and CRF is very limited because we can only afford to use or have very few number of
connections. But here we can
have very large number of connections.


And then computing is what I emphasized. What you can learn is what you can compute.
The testing almost shares identical procedure as computing. So the only difference is this
generalization arrow. But that's i
t. So you don't have to have a very heavy algorithm
design. And for the identical algorithm it works on this horse, all these different things.


So the conclusion for the auto
-
context part is we learn low
-
level context modeling
integrated framework. S
o that's important because we don't want to distinguish too much
from all these high level from the low level, although this is still black box type of thing,
so the new thing is to push for more explicit. Very easy to implement. Significantly
faster tha
n the sampling or [inaudible] algorithm [inaudible] have an algorithm design all
these things. But also it has definite advantage. It acquire training for different
problems, so for some of these [inaudible] work, you can define your energy, then you
can

use off
-
the
-
shelf algorithm. So that's an advantage. But the disadvantage here you
need to train.


But for solving really the difficult problem, probably it's hard to define some energy and
apply off
-
the
-
shelf algorithm. That's at least one doubt I h
ave. We have to look into the
structure of the problem. Training is alone, but now we reduce it significantly. Require
labor data. But it's not

--

so we're pursuing weekly supervised, unsupervised in this
domain.


Since I have a little bit of time, I
'll just spend two minutes

--

this is something I really
like, this year's work. I really like. So we talk about the segmentation shape.
Understand it has been a longstanding problem. And this problem bothers me a lot. As
even we have the shape of her
e, how do we tell this is a horse, this is dog and this is cat
and this is a dog and this a mouse.


So sometimes local parts can tell you a lot. This is typical horse kind of gesture, and this
is typical dog. We last

--

we're less frequent to see a hor
se like this. And also this kind
of a shape is important. So local parts tells us a lot of information.


But sometimes the global information is also important. Because you see this is a cat,
this is a dog. Local parts, they're very much similar. Bu
t on this bumpy type of thing,
pretty much says it's unlikely to be

--

although cat

--

sometimes do that, do this kind of
thing.


So local versus global. And also if we have highly articulated objects or nonrigid objects,
the definition is important [in
audible] and it's hard to be captured by local parts. So these
are the two mouse. They are viewed from completely different views. But so the parts
are different, and the contours, but the skeleton information is more or less similar. So if
you compute
d the skeletal medial access and computed radius sequence on this big
[inaudible] they are more or less the same. So there are now three concepts. One is
contour versus skeleton. Another is local versus global.


So how do we really

--

how do we really

understand this is particular shape information.
Put them together. I'm not going to get to the math. But conceptual is very easy to
understand. This is recent datasets we collect for 20 objects and it's probably most
difficult shape dataset ever. An
d this is tall grass, you can see the large variation, deer,
crocodile, bird, butterfly. So even [inaudible] give you well
-
segmented objects, how do
you tell, you see this is way beyond the understanding of elephant and the monkey or the
cat you see the l
arge variation. How do we really do this.


This problem puzzled me a lot. So one thing we did is very simple. So given a shape,
let's say considers a contour, we perform a simplification as a polygon. And you can
extract these high
-
curvature points a
nd then put them back. So now you have points. So
for every pair of points you can have a sequence. You throw them into a same space and
normalize them. You obtain signature. So this is one [inaudible] paper my student had.
So the source code is ther
e. Everything you can repeat.


So it's very interesting to point out. So given a shape, you compute the high
-
curvature
points of these [inaudible] points. You throw all the pair

--

pair
-
wise type of features that
is kind of local parts together and no
rmalize them. That's it. So that's about your
contour. And also you can compute this skeleton. And you also obtain these kind of
high

--

the so
-
called N points. And you

--

for every pair of N points, you have a path
from one point to other. Then you
put the disk, compute the radius sequence.


So this is very robust against junction because you don't

--

if you have multiple junctions,
that's typical in skeletons, you only compute the path. So it's very robust against the
junction. And you obtain an
other set of feature. So now you have two sets of features.
You just throw them away. So throw them over there and consider

--

put a nearest
neighborhood classifier on top of it. So that's it. There is some kind of play with the

--

these two type of f
eatures. I'm not going to the details.


But now so you have two sets of features. One is from the skeleton, another is from the
parts. But here the scale information is [inaudible] because you just put every pair. So
you cross all different scales.
Every pair of points are there. And you obtain a nearest
neighborhood of

--

so this is a result on all data. So on the famous [inaudible] data in
which there's 70 structure

--

70 objects, each has 20. We get 96 percent classification
rate. So that's ev
ident. That particular dataset is not difficult.


So our dataset, so we compare some of the existing work. We obtain 78.7, and many are
the existing matching type of work. It's slow and the results are much worse. Let's say
shape or context or these
matching based algorithm.


And this is a matrix for different objects. CS meaning only use counter segments. SP is
only skeleton path, and CS and SP is the combination of them. So you can see it's very
interesting. The combination of them always give
s a better result. And they are

--

so for
this mouse category, you can see skeleton path tells a lot. Because on the contour, it's not
that robust. But the skeleton, because mouse is highly nonrigid or articulate. It's not
rigid. So skeleton path is v
ery informative.


For some objects, let's say bird, this contour is very informative, the mouth of this bird
[inaudible]. And also we don't have to explicitly worry about the scale, and there's no
explicit matching. So we really get very good results.

And this is on the real data
imagine, so we have the horse and the cow, you just learn a classifier to obtain. So this is
the probability map. You do a binarization. Then you obtain these are the masks. So
then you train your algorithm, it would obtai
n a 96

--

so 96.5 accuracy

--

recognition rate
or classification rate on the horse and the cow.


So I'm quite happy given this thing because, as I said, this

--

the problem puzzles me a
lot, but [inaudible] play by a very simple play, we were able to com
bine the contour
together with skeleton and you don't have to explicitly worry about scale, and avoids the
explicit matching kind of thing. And even we're given well
-
segmented structures it's
hard, but I think we

--

given this kind of work we push this ge
nerous shape classification
to the stage where things can really

--

can be naturally coming out.


>>: Sorry. A lot of the [inaudible] what you're doing with the real images now, because
how to do [inaudible] these real images?


>> Zhuowen Tu: So rea
l images now

--

in the training I just

--

we have the label map.
Then you just train a classifier in terms of segmentation. Now, it's a foreground and
background segmentation. Then you obtain a probability map in terms of the object. But
you don't know

which one it is. So you

--

only thing

--



>>: [inaudible] using the training data from that sense to

--



>> Zhuowen Tu: Obtain segmentation.


>>: And you're doing just the outlines of those

--



>> Zhuowen Tu: Yes.


>>:
--

in your technique he
re?


>> Zhuowen Tu: Yeah.


>>: And then you

--

and then you train up your image base using [inaudible] what's the
middle

--

what's the middle

--



>> Zhuowen Tu: So the middle, this one is the classification probability map. So once
you obtain the
probability map, you just do a simple binarization to obtain the
segmentation.


So here it's a fuzzy map. Then you just make a hard decision to obtain segmentation.


>>: [inaudible]


>> Zhuowen Tu: Coming from here.


>>: The binary map.


>> Zh
uowen Tu: Yes. From the binary

--



>>: [inaudible]


>> Zhuowen Tu: Yes. Yeah. That's the performance.


[multiple people speaking at once]


>>: You train yours on this kind of image.


>> Zhuowen Tu: Yes, yeah.


>>: On this kind of contour,
not at [inaudible].


>> Zhuowen Tu: No, no, training is on the real contour.


>>: Training on a real contour.


>> Zhuowen Tu: Is a manually annotated contour. So this says even though you have
really a bad segmentation results, the algorithm is st
ill robust because you see the parts
are missing. But because of all the other parts

--

yeah, yes. That's right. Yeah.


>>: [inaudible]


>> Zhuowen Tu: Yes, yeah, yes.


>>: [inaudible] the skeleton will not give you much.


>> Zhuowen Tu: Ther
e is still

--

skeleton you see 85.


>>: Yeah, I think about [inaudible].


>> Zhuowen Tu: For some years, yeah. For some. It's again, the overall, the message is
combining a skeleton with the contour really helps, and here we throw everything into th
e
space as a nearest neighborhood, kind of it's very simple but it captures a lot of the
essence of the shape problem.


>>: Really the message of this is just that it's kind of a testament to the robustness of the
original algorithm to having pieces of
missing

--



>> Zhuowen Tu: Yes, yeah, yeah. Occlusion all these things, yes.


>>: [inaudible]


>> Zhuowen Tu: Yes, yeah, yeah.


>>: I mean, in a way, this is a very rounded

--



>> Zhuowen Tu: Yes.


>>:

--

it's like a very roundabout way of
doing what might have been a simpler

--

I mean,
I [inaudible] like to see just with occlusions or with noise of other kinds like this, how
does it do great. And I'd imagine it would do very well.


>> Zhuowen Tu: Yes, yeah. Because all these

--

these i
mages, you already have a large
occlusion, self
-
occlusion. Yeah, self. But the question is how do we

--

because
nowadays, if this would be five years ago, I don't have to show the results on the real
image. But now people always ask how this whole thing

work on the real image. So
that's why we want to show that it can handle errors in segmentation and it's

--

because so
many times the parts already tell you a lot of story.


But if you were to do a matching, it's really almost impossible task to match
all these
parts all together because you have to handle all these scale problem, all these things. So
it's kind of implicitly getting away from explicitly matching the parts and how do you
combine the contour all these things. So I think the message is c
lear, yeah.


>>: But the results here, are they horse versus cow?


>> Zhuowen Tu: Yeah.


>>: Okay. The

--

in the cow dataset, if I remember correctly, it's all

--

they look very
similar to the cows that you show here.


>> Zhuowen Tu: Yes.


>>
: So the cow data is very similar.


>> Zhuowen Tu: Yes.


>>: It's all cows in the same background [inaudible]. So do you get a sense for the
classifier as kind of recognizing the cow's head and then it's kind of like the main

--



>> Zhuowen Tu: Ye
s, yes.


>>: So that's one benefit of your classification, you can say I recognize the subregion,
which is the cow's head, and that's actually acting as my main yes/no gain.


>> Zhuowen Tu: Yes, yeah, yeah. So we have to look into the

--

really the
details which
one is just [inaudible] summation. Yeah. But that's a very good question because

--

yeah. Sometimes just one

--

this particular chunk already helps

--



>>: [inaudible] a very good [inaudible].


>> Zhuowen Tu: Yes, yeah, yeah, yeah. Ye
ah. That's

--



>>: [inaudible]


>> Zhuowen Tu: Yes, yeah, yeah, yeah. So sure. I think in a sense it's artful play with a
very simple technique without going into too much hassle with [inaudible]. So you can
see even this chunk is completely round.

But all the other parts can play the role. Yeah.
Yeah.


>>: [inaudible]


>> Zhuowen Tu: Yeah. Yeah. That's a good question. So that was actually part of my
proposal in ONR which is [inaudible], so, yeah, to combine them together [inaudible].


>>: [inaudible]


>> Zhuowen Tu: Yeah, into shape prior. Yes. Yeah.


[multiple people speaking at once]


>> Zhuowen Tu: Yeah. Okay. I think that pretty much concludes my talk. Yeah.
Thank you so much.


[applause]


>>: One of the things that

--

I have [inaudible].


>> Gang Hua: We still have 15 minutes.


>>: Okay. So one just has to do with kind of relationships between what you're doing
and what happens in visual cortex. I mean, one of the things that's kind of cool about this
is that
it

--

your immediate propagation of information between it being distant or
[inaudible] invariant, very different scales in one step. Seems to me a little bit like the
kind of connections that happens say between V1 and V4 directly [inaudible] these
conne
ctions that go directly through intermediate layers and it's the first time that I've
seen a kind of cascaded model that has those long
-
range connections across layers like
that.


But the other thing that's also happening in visual cortex that hasn't gen
erally been
modeled very well in these kinds of things is feedback. And so I'm wondering if you've
thought about incorporating feedback more in the sense of taking your later results and
then proving earlier layers using

--

not using the [inaudible].


>>

Zhuowen Tu: Yes. Yeah. That's

--

that's an outstanding question. So my previous
research Gang actually knows quite well is this so
-
called [inaudible] Markov Chain
Monte Carlo is to activate the high
-
level part but driven by the low level part is the
f
eed
-
forward. But recently I'm a little bit moving away at least at this moment from that
because if you do this, the learning and computing part is pretty heavy. But here you can
see every step is a feed
-
forward or deterministic. So you can see you can
push the result
to a certain degree back. It's far from being perfect. Yeah.


>>: It also makes your learning nonconvex [inaudible]

--



>> Zhuowen Tu: Yes. Yes.


>>:

--

when you see all those problems.


>> Zhuowen Tu: Exactly. So in the end
we still need to bring the other knowledge. So
another direction I'm pushing is to combine the so
-
called implicit and explicit
information.


So I'm not happy with the [inaudible] stage where many things are implicit. I want to be
more explicit. So the
n there's so
-
called unsupervised, supervised and discriminate
[inaudible], all these things I want to put them together. So that's a more recent work I'm
doing to put more into a group rather than feed
-
forward. It's like traditionally I did this
whole he
avy computing of top
-
down/bottom
-
up. Now I'm trying to get away from it a
little bit

--



>>: [inaudible]


>> Zhuowen Tu: Yes. In one way because many of the information, just like Google,
we can decide really quickly. Then once these things are more

set, then probably we
need to bring them back. Although this feed
-
forward way is kind of

--

there is
information being propagated, because when you do the classification, some other parts
already tell

--

if this part is dominant already gives you the inf
ormation, so there is indeed
some kind of

--

the feedback over there. But it's not like what you're describing.


>>: [inaudible] there's plenty of reaction [inaudible] that suggests that the feedback is
really important for learning but is usually not
important for discrimination, for the actual
detection. Otherwise you can prove that it would take too long.


>> Zhuowen Tu: Yes, yes. That's right.


>>: Okay. Well, my other question is I wonder if you thought about applying any of
these ideas to

text.


>> Zhuowen Tu: That's great. Yes, yeah. That's something, yes, I want to do. So here
is an image. I really want to apply it only just language or text [inaudible]

--



>>: [inaudible] one dimensional.


>> Zhuowen Tu: Yes, one dimension.


>>: [inaudible]


>> Zhuowen Tu: Yes. Yeah, yes. I actually tried on one particular case that shows
improvement, but I haven't got a chance to really work hard on this language. Because I
think it's natural compared to the existing CRF model, the ex
isting [inaudible] model. I
think

--



>>: [inaudible] very special [inaudible]

--



>> Zhuowen Tu: Yeah, yeah.


>>:
--

[inaudible] to me and I would love to

--



>> Zhuowen Tu: Yeah.


>>:

--

to maybe [inaudible].


>> Zhuowen Tu: Sure, sure. Y
eah. Yeah. Yes.


>>: In your brain segmentation, how many structures were you [inaudible]?


>> Zhuowen Tu: 56.


>>: And were you using an atlas to drive the segmentation or

--



>> Zhuowen Tu: So it's similar to this one. We have 48 brains manu
ally annotated. You
can consider that's the atlas. Then you train your algorithm, yeah.


>>: And each of them had 56 structures in it?


>> Zhuowen Tu: Yes.


>>: Okay. And what were the range variation in shapes of

--



>> Zhuowen Tu: I think li
ke 20 are control, 20 are like

--

there's 10 Alzheimer's disease,
10 like schizophrenia disease. So yeah. So but I don't really have a detailed description
on kind of where, because it's very hard to compute the variance for the structure because
it's 3D
.


>>: From where the

--

I see. Okay. And these are

--

you know the MRI machine
specification [inaudible].


>> Zhuowen Tu: Mostly Siemens I think.


>>: [inaudible]


>> Zhuowen Tu: I don't

--



>>: Okay.


>> Zhuowen Tu: I can look into the p
aper. I recently sent out paper. But I

--



>>: [inaudible] visual features [inaudible] visual features?


>> Zhuowen Tu: So here I pretty much used gradients and also some of the 3D Haars
and curvatures, mean curvature, Gaussian curvature, all these.



>>: His was not [inaudible].


>> Zhuowen Tu: [inaudible] you should use. But intensity sometimes not robust. For
one scan you can have intensity 100; another one you can have 20,000.


>>: [inaudible]


>> Zhuowen Tu: Yeah. There is a lot. One

thing you can do is to normalize the
intensity. But it's hard to normalize them completely. So gradient information. Yeah.


>> Gang Hua: Thank you.


>>: That was great.


>>: One more question. Was does the smallest structure that you segmente
d in the
brain?


>> Zhuowen Tu: I think hippocampus probably is.


>>: And what was the accuracy? Do you remember?


>> Zhuowen Tu: Hippocampus is the worst. I think around 75. Yeah. Yeah.
Hippocampus is the worst, especially the tail part. It
's been the most difficult structure.
Yeah. Okay. Thank you.


[applause]