Clustering appearance and shape by Jigsaw, and comparing it with Epitome.

overratedbeltΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

83 εμφανίσεις

Final project: Bae&Kim

Nov. 27, 2006

1


Clustering
appearance and shape
by Jigsaw
,
and comparing
it

with
Epitome.



Soo Hyun Bae, Woo Young Kim

Georgia Institute of Technology

CS 7495 Fall 2006

Due
:

November

2
8
, 2006



1.

Overview

This project

involves 4 papers, two of them are for the original im
plementation and the other two
are for their corresponding methods.
Our original intention
was to implement the
advanced
techniques
appeared in the paper, “Epitomic analysis of appearance and shape,”

by N. Jojic, J.

Frey
and A.

Kannan,
although
the
impleme
ntation of the portion of
learning
Epitome

was

released in
public. While we
are

implementing the applications including
image
segmentation and
image
denoising with
the
Epitomic model, we found

out that two recent papers have been
proposed
for
improving
the

original epitomic analysis : (1) “Video epitomes,” by V.

Cheung, J.

Frey and N. Jo
j
ic.
(2) “Clustering appearance and shape by learning jigsaws,” by A.

Kannan, J. Winn and C.

Rother.

The paper (1) is an extended version of the original paper for the spati
o
-
temporal domain while
retaining the ge
nerative models of given source and its training methods.

Also it provided more
detailed of how to exploit higher dimensional sources (e.g. videos). The paper (2) suggested a
flexible technique for generating texture

map, called
Jigsaw

in this paper,
and provided more
feasible results for images.

Accordingly

we
determin
e

to
implement the “Jigsaw” algorithm in the paper

(2)
, and compare
the results with epitomic learning and reconstructing.

For
aiming this goal
, we
hav
e

to
employ

G
raph
C
ut” algorithm
, which is introduced in

Fast approximate energy minimization via graph
cuts
, ” by
Y B
oykov, O. Veksler, and R. Zabih, since the “Jigsaw” is obtained through iterative EM
steps
. An offset

map, which generates a bridge from

a given image to a jigsaw map, is
obtained by
a
n

optimization procedure based on a
graph
-
cut algorithm.

Final project: Bae&Kim

Nov. 27, 2006

2



2.

Papers

[1]
A. Kannan, J. Winn and C. Rother
Clustering appearance and shape by learning jigsaws


To appear in
A
dvances in Neural Information Processing Systems
, Volume 19, 2006.


[2]
Y Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts.
PAMI
, 23(11),

2001.


[3]
V. Cheung, B. J. Frey, and N. Jojic.


Video epitomes.


In
Proc. IEEE Conf. Computer Vision
and Pattern Recognition (CVPR)
, 2005


[4]
N. Jojic, B. J. Frey, and A. Kannan.


Epitomic analysis of appearance and shape.


In
Proc.
IEEE Conf. Computer Vision (ICCV)
, 2003.



3.

Algorithms

Epitomic and Jigsaw models are
of
patch
-
based probabilistic
generation
.
Such
represen
tations of
a
given
image
I

are
much
compacter
than the source
representation,
but retain most of the
constitutive elements needed to reconstruct it.
The differences
between the two
are
as follows
:
While the
E
pitomic analysis and other patch
-
based models ha
ve

to specify the appropriate set of
dimensions of patches

and shapes before learning,
the
Jigsaw model can
effectively
learn the patch
sizes and
shapes in the learning process.


3.1.

Epitome


We consider epitome as a set of Gaussian distributions,
that is, ea
ch entry has

a
mean and
variance
, which are

parameters.
T
o learn
the

parameters of
epitome, we train a set of patches
from a
set of images

I
, and
update the parameters
by
averaging

patch
values

with appropriate
mapping
T
.

The training patches
z
k
’s are syst
ematically selected, for example, all the possible or partially
Final project: Bae&Kim

Nov. 27, 2006

3

overlapped patches in
I
.
Figure 1 shows the graphical model of epitome.

According to it,
for each
pixel
x

in image
I

to generate
, we
consider

a

number of patches
z
k
’ that

share
x
.
Then,


, wh
ere N is the number of patches that share
x
.

Each patch is generated by epitome
e
, using the
corresponding mapping
T
.


Then the entire model has the following joint distribution.



Maximizing the joint distribution lead
s

to the following update rules t
o be iterated.







The first and second value
s

are

computed in E
-
step, and the third and last
are

updated in M
-
step.

The value in
(3)

determines epitomic mean and
(4)
epitomic variance. Note that
(1)

is estimated
value assuming that ep
itome and the mappings are known.


We should, however, decide the patch sizes and shapes before learning, and
practically

the patch
shape is
chosen as
rectangle.



(1)


(2)


(3)

(4)

Final project: Bae&Kim

Nov. 27, 2006

4



3.2.

Jigsaw

Jigsaw is also

a

patch
-
based probabilistic

generative model
, and each entry is a
Gaussian
distribution
too
.

Although i
t learns a jigsaw from a set of training
patches
of

input
images

as

epitome,

the patches are not selected manually, but
it learns the appropriate sizes and shapes of the
patches automatically
.

In other words, t
he main d
ifference of jigsaw against epitome is
having
an
associated offset map
L

of the same size as the input image
I
.

Figure 2 shows the graphical model
of
generative process of
Jigsaw.

Each pixel
z

in a jigsaw
J

is defined as the intensity
μ(z)
,

and a
variance
1/λ
,
(
λ

is called the precision
), as epitome mean and epitome variance.
It joins together
pieces of th
e jigsaw using hidden offset

map

of the same size with input
,
and then

add
s

Gaussian
noise of variance given by the jigsaw.

Figure 2
is the graphical model of Jigsaw.

Note that the offset map
L

defines a position in the jigsaw for each pixel in the input.

Going

into detai
ls

for learning the jigsaw, it assumes the followings;

First each pixel of input image

I

is independent conditioned
on Jigsaw
J
, and offset map
L
.


To preserve the local consistency in the input, so that neighboring pixels have the same offsets,
it defines Markov Random Field.


Figure 1
. Graphical model of
epitome:
e

is epitome,
T is hidden
mapping, and z

s are the

patches

predicted from
e
.
x

s are
measurements which is gen
erated
by adding noise to the patches.

(5)

Final project: Bae&Kim

Nov. 27, 2006

5



J
X
L



,where
E

is the set of edges in 4
-
connected grid.

In
order to allow having

the unused pixels in Jigsaw while learning
, it has the following Normal
-
Gamma prior.


Now,

it learns Jigsaw by maximizing the following joint probability
, iteratively.


As epitome iterate the hidden mapping
T

and parameters
-

mean

and variance of each Gaussian
distribution
-
, Jigsaw iterate the hidden offset map
L

and parameters, which is mean and precision.

Figure
2
. Graphical model of
Jigsaw:
J

is epitome,
L

i
s hidden
offset

map, and
X

is an

input.

(6)

(7)

(8)

Final project: Bae&Kim

Nov. 27, 2006

6

The offset map
L
is updated applying the alpha
-
expansion graph
-
cut algorithm in
[2].

W
ith the
L
,
it update
s

mean and precision

in the followings.


, where
X(z)

is the set of image pixels corresponding
z

pixel in Jigsaw.


4.

I
mplementation

W
e implemented t
he learning
process
of the Jigsaw model and reconstruction with it
.

As a part
of

learning

the

Jigsaw model, we implemented the
graph
-
cut in paper [2]
, in order to get offset map.

4.1.

Jigsaw

T
h
ere are two main parts. First it initialize
s

the Jigsaw by setting the precisions to be the
expected value under the prior
b/a
, where
b

is a shape,
a

is
a
scale
parameter i
n Gamma of prior of
Jig
saw. The mean is
equivalent to the
one
of
the input
distribution

on Ga
u
ss
i
an
.
This helps us to
avoid falling into a
local maxim
um from the globally optimum solution
.
Beginning with
these

initial Jigsaw parameters,
we
get offset map
L

by
the
graph
-
cut algor
ithm. This process will be
detailed in the next subsection.

After offset map
L

is
obtained according to the input data, it updates means and
precisions

using
it
. The detail is explained in the 3.2 section.

We repeat this process until
it
converges.

4.2.

Graph
-
cut

The focus of the paper
[2] is minimization of energy in early vision problems. The goal is to find
a labeling
f
that assigns each pixel


p

a label
L
f
p

, where
f

is both

piecewise smooth
and consistent wit the observed data. Many vision problems can be naturally formulated in terms of
minimization of two energy terms,

)
(
)
(
)
(
f
E
f
E
f
E
data
smooth



Here,
smooth
E

measures the extent to which
f
is not piecewise smooth, while
)
(
f
E
data

measures the disagreement between
f

and the observed data. In generating the optimized offset
(9)

(10)

Final project: Bae&Kim

Nov. 27, 2006

7

map for a given Jigsaw, the label set
L

corresponds to a set o
f index for the Jigsaw set, thus
f

is
a Jigsaw index. The data energy is formulated as





p
p
p
data
f
D
f
E
)
(
)
(

At a certain pixel
location
p
, we calculate the datacost for a given offset.

We set the distortion
fu
nction as a second norm
.

The datacost

)
(
p
p
f
D

is formulated as second norm between the
pixel value and the corresponding jigsaw value mapped by the offset,

2
)
(
)
(
p
p
p
p
i
f
f
D



The smoothing term considers the distance of a location
p

between the neighboring jigsaw
values,





}
,
{
}
,
{
)
,
(
q
p
q
p
q
p
smooth
f
f
V
E

We have tested several distortion functions so that weighted first order norm works well as
follows:

pq
q
p
q
p
w
f
f
f
f
V



)
,
(

here, we set the
pq
w

to
0.2 for any.


In the paper [
1
]
, the authors described that they utilized alpha
-
expansion graph
-
cut algorithm for
obtaining a
n

optimized offset map for a given Jigsaw map. However, from a number of experiments,
we here conjecture that the alpha
-
expansion a
lgorithm does not work correctly. We would say that
the general expansion graph
-
cut algorithm only work for this optimum offset map generation.


4.3.

Epitome

For the purpose of comparing, we
implemented the reconstruction of the original image from the
help of
the open source code for
training epitome
.

We used the
training implementation
of “video
epitomes,” instead of the original image epitome, believing the former code shows
better
performances.
(11)

(12)

(13)

(14)

Final project: Bae&Kim

Nov. 27, 2006

8

Results

We focused on comparing the Jigsaw results with epitome,

since the Jigsaw seems to be
superior

to the epitome in the sense that it can automatically learn the appropriate patch sizes and shapes.

For
the purpose of clear comparison, e
ven though there are other applications then reconstruction, we
applied the Jig
saw model for reconstruction only
,
and then

compared the results with epitome
.


Input


Jigsaw

Reconstruction

O
ffset map




Epitome

Reconstruction (average)

Reconstruction

(no avg.)




Input


Jigsaw


Reconstruction


O
ffset map


Epitome


Reconstruction (average)


Reconstruction
(no avg.)


Final project: Bae&Kim

Nov. 27, 2006

9

Input


Jigsaw


Reconstruction


O
ffset map


Epitome


Reconstruction (average)


Reconstruction(no avg
.
)






As
a
result
, we demonstrate the
comparison

with three images. We present Jigsaw me
an and
reconstructed imag
e with the inferred offset map. Epitome mean and the reconstructed images are
also shown.
In the reconstruction with epitome, we show
two reconstructed image, one of which
used

only one patch per pixel,
the other

a number of patche
s per pixel. In fact, the former method is
most
analogous

to the jigsaw reconstruction, since
Jigsaw algorithm

only needs one offset per pixel.

A
s we can see, even in the case of averaging patches with epitome, jigsaw is doing much better job
for reconstru
ction.

Notice that the image with one patch per pixel is very blocky and the other is
blurry. W
ith Jigsaw, we almost got the perfect reconstruction
.

Note that, however, the image of Jigsaw mean still
requires

additional clustering step if we want
to get t
he

Jigsaw

images
introduced

in the paper.
Because of the ambiguous explanation in the
paper,
however,
we could not implement this cluster part yet.

Figure

3
.
Comparison

of epitome and Jigsaw with reconstruction.
For the better understanding, we used
the same images in
the Jigsaw paper, as shown in the second and third image.
The top image is selected as
a more
complicated

image to compare.


Final project: Bae&Kim

Nov. 27, 2006

10

5.

Problems

There were several
problems
,

however,

that
we

encounter
ed while implementing the papers.


5.1.

Out of
memory

When we implement the graph
-
cut algorithm to
obtain
the appropriate offset map in Jigsaw
learning, we
face

a critical problem of insufficient memory
.
Again, we want to minimize the data
cost




p
p
p
data
f
D
f
E
)
(
)
(
.

For every pixel, the cost map for

the entire Jigsaw values should
be generated.
The size of the “bird” image 268x179 and the Jigsaw size was set to 30x30. Thus, the
data cost map requires
MB
690
16
30
30
179
268






for ‘double’ precision, which is
beyond the maximum limitation of 32bit MATAL
B.
Although the costs are truncated into 32bit
integer, it still requires around
170MB
, which is not affordable yet.
Therefore, we implemented the
graph
-
cut algorithm with MEX, a C implementation of MATLAB function. The current
implementation only works in

a 64bit Linux system with higher than 8GB system memory.


5.2.

Clustering the output of Jigsaw image

The third image shown in the previous table is same as the image shown in the paper
[
2
]
, but the
jigsaw is not same as the corresponding jigsaw figure. In the
paper
[
2
]
, the authors mentioned that
they applied a clustering step to determine the jigsaw pieces, but no further details.
W
e did not
implement the clustering algorithm

at this time around
,

but,
w
e note that the clustering algorithm
helps
us
see the jigs
aw pieces i
n comprehensive representation.