Fusing cluster-centric feature similarities for face recognition in video ...

gaybayberryΤεχνίτη Νοημοσύνη και Ρομποτική

17 Νοε 2013 (πριν από 3 χρόνια και 4 μήνες)

78 εμφανίσεις

Accepted Manuscript
Fusing cluster-centric feature similarities for face recognition in video sequen‐
ces
John See, Mohammad Faizal Ahmad Fauzi, C. Eswaran
PII:S0167-8655(13)00259-6
DOI:http://dx.doi.org/10.1016/j.patrec.2013.07.004
Reference:PATREC 5770
To appear in:Pattern Recognition Letters
Please cite this article as: See, J., Ahmad Fauzi, M.F., Eswaran, C., Fusing cluster-centric feature similarities for
face recognition in video sequences, Pattern Recognition Letters (2013), doi: http://dx.doi.org/10.1016/j.patrec.
2013.07.004
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers
we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting proof before it is published in its final form. Please note that during the production process
errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.


Fusing cluster-centric feature similarities for face
1
recognition in video sequences
2
John See
a,
¤
,Mohammad Faizal Ahmad Fauzi
b
,C.Eswaran
a
3
a
Faculty of Computing and Informatics,Multimedia University,Cyberjaya 63100,
4
Malaysia
5
b
Faculty of Engineering,Multimedia University,Cyberjaya 63100,Malaysia
6
Abstract
7
The emergence of video has presented new challenges to the problem of face
8
recognition.Most of the existing methods are focused towards the use of
9
either representative exemplars or image sets to summarize videos.There
10
is little work as to how they can be combined e®ectively to harness their
11
individual strengths.In this paper,we investigate a new dual-feature ap-
12
proach to face recognition in video sequences that uni¯es feature similarities
13
derived within local appearance-based clusters.Relevant similarity matching
14
involving exemplar points and cluster subspaces are comprehensively mod-
15
elled within a Bayesian
maximum-a-posteriori
(MAP) classi¯cation frame-
16
work.An extensive performance evaluation of the proposed method on three
17
face video datasets have demonstrated promising results.
Keywords:
Video-based face recognition,Clusters,Exemplars,Feature
18
fusion,Bayesian MAP classi¯cation
19
1.Introduction
20
The emergence of video has presented new challenges to the problem of
21
face recognition,particularly in the abundance of appearance variations,com-
22
¤
Corresponding author.Tel.:+60 3 83125478.Fax:+60 3 83125264
Email address:
johnsee@mmu.edu.my
(John See)
Preprint submitted to Pattern Recognition Letters July 11,2013


plex head poses and other external conditions.The accessibility of imaging
23
devices and extensive array of applications in consumer products,biometrics
24
and surveillance has rendered automatic face recognition an increasingly im-
25
portant task.Recent surveys on face recognition (Zhao et al.,2003;Li and
26
Jain,2011) and the well-known Face Recognition Grand Challenge (FRGC)
27
(Phillips et al.,2005) are evident of the signi¯cant attention by the computer
28
vision and pattern recognition community to address this problem.Current
29
literature for video-based face recognition can be loosely divided into
still-
30
face-to-video
matching and
video-to-video
matching (Poh et al.,2010).
31
The notion of
video-to-video
matching typically exempli¯es a recognition
32
problem over image sets,which can be in the form of temporally ordered or
33
unordered sets,and consisting of entire sequences or only selected images.
34
Image sets derived from both training and test videos are usually repre-
35
sented by subspaces or manifold models learned from the original vectors.
36
Classi¯cation is then accomplished by computation of subspace distances or
37
similarity measures among these learned subspaces.Parametric model-based
38
approaches use Kullback-Leibler Divergence (KLD) metric (Shakhnarovich
39
et al.,2002;Arandjelovic et al.,2005) to evaluate similarities between esti-
40
mated densities.With di±culty of parameter estimation under limited data,
41
these methods tend to fail when training and test sets are not strongly cor-
42
related statistically.Methods based on
principal angles
have a long standing
43
interest due to its simplicity.Mutual Subspace Method (MSM)(Yamaguchi
44
et al.,1998),the seminal work on principal angles for recognition uses the
45
cosine of the smallest principal angle,or largest canonical correlation as a sim-
46
ilarity measure.Further improved methods that use principal angles { Con-
47
2


strained Mutual Subspace Method (CMSM) (Fukui and Yamaguchi,2003),
48
Kernel Principal Angles (KPA) (Wolf and Shashua,2003) have also demon-
49
strated good results in face recognition.Later on,discriminant methods that
50
maximize the discriminatory power between within-class and between-class
51
sets were also proposed { Discriminative Canonical Correlations (DCC) (Kim
52
et al.,2007),and more recently,a graph embedding discriminant analysis
53
based on Grassmannian manifolds (Harandi et al.,2011).
54
Lesser often mentioned are exemplar-based representations for face recog-
55
nition in video,which have also attracted increasing attention in the litera-
56
ture.Exemplar-based representations deal with the large array of face images
57
in training videos by selecting a signi¯cantly smaller number of appearance-
58
speci¯c representative face images or
exemplars
to summarize each subject
59
class.In common con¯gurations,clusters or patches are extracted from the
60
data manifold and sample means are chosen as exemplars.Although natu-
61
rally regarded as a
still-image-to-video
matching approach,it usually requires
62
a fusion of matching scores between exemplars (training set) and multiple
63
query frames (test set) in order to realize a full
video-to-video
matching.
64
Numerous approaches to exemplar selection using radial basis function net-
65
work (KrÄueger and Zhou,2002;Zhou et al.,2003),clustering by
k
-means
66
(Hadid and PeitikÄainen,2004;Fan et al.,2005),Hierarchical Agglomerative
67
Clustering (HAC) (Fan and Yeung,2006;See and Ahmad Fauzi,2011) have
68
been proposed.Wang et al.(2008) introduced Maximal Linear Patch (MLP)
69
which partitions data on the manifold into local linear models.More recently,
70
Spatio-Temporal Hierarchical Aggomerative Clustering (STHAC) (See and
71
Eswaran,2011) was proposed to leverage the use of both spatial and temporal
72
3


information for the selection of exemplars from ordered images.Relatively
73
good results have been reported for exemplar-based methods although this
74
setup is essentially sensitive towards the e®ectiveness of clustering appear-
75
ance variations and the number of exemplars chosen.
76
Despite claims of superiority of
video-to-video
methods in literature,an
77
analysis from a recent evaluation of video-to-video face veri¯cation (Poh
78
et al.,2010) o®ers an observation that performance of both
still-image-to-
79
video
and
video-to-video
matching are not signi¯cantly di®erent and success
80
depends on a few crucial factors such as choice of manifold representation,
81
distance/similarity metrics and length of video sequence.
82
However,our motivation by the growing trends of these approaches begs
83
the question:
How can we fuse together both exemplar and image set matching
84
to further improve recognition ability?
Approaches that utilized both image
85
sets and exemplars are few.Most notably,Wang et al.(2008) proposed a
86
weighted distance measure involving both subspace and exemplar distances
87
by means of local linear patches for recognition on the face manifold.While
88
the fusion into a single distance measure is straightforward and e±cient,the
89
optimum setting for the weight parameter is di±cult to determine and does
90
not fully leverage interactions at the cluster level.
91
In this paper,we introduce a new fusion approach that incorporates
92
both cluster subspace and exemplar point similarities derived within local
93
appearance-based clusters for face recognition in video.In terms of the dual-
94
feature representation,the learned cluster subspaces are capable of character-
95
izing set variations at the cluster level while the selected exemplars encode
96
detailed appearance information at the point (image) level.The subjects
97
4


in video are recognized using a Bayesian
maximum-a-posteriori
classi¯er,
98
where a joint probability function that captures relevant dependencies is for-
99
mulated using similarity metrics.The promising results obtained from an
100
extensive evaluation on three face video databases { Honda/UCSD,CMU
101
MoBo and ChokePoint,with further comparisons with other related meth-
102
ods have demonstrated the e®ectiveness of the proposed approach.
103
The rest of this paper is organized as follows.In Section 2,we brie°y
104
describe the problem setting and representations used.Section 3 elaborates
105
on the proposed approach in detail.Experimental results and further analysis
106
are discussed in Section 4.Section 5 concludes this paper.
107
2.Feature Representation
108
In this work,we adopt a similar setting to that used in exemplar-based
109
approaches,where face images in a video sequence are grouped into local
110
appearance-based clusters in the initial step.These local clusters o®er rich
111
variational features that can be exploited.Intuitively,subspaces or manifolds
112
describe coarse set-based variations while selected points within a cluster re-
113
°ect ¯ne appearance variations.In both training and test sequences,each
114
subject class is represented in dual-feature mode { as subspaces (entire clus-
115
ters) and points (exemplar images).
116
2.1.Problem Formulation
117
Generally,we de¯ne a sequence of face images extracted from a video of
subject
c
as an array of observed data in
R
D
,
X
c
=
f
x
c;
1
;
x
c;
2
;:::;
x
c;N
c
g
;
(1)
5


where
N
c
is the number of face images in the video,with the subject label
118
of a
C
-class problem,
c
2 f
1
;
2
;:::;C
g
and each video is assumed to contain
119
faces of the same person.
120
For each training video
1
,
M
number of clusters are extracted for each
subject,
Z
c
=
f
z
c;
1
;
z
c;
2
;:::;
z
c;M
g
,where (2)
z
c;m
=
f
x
c;
1
;
x
c;
2
;:::;
x
c;N
m
g
(3)
is the
m
-th cluster,a vector of
N
m
images.Subsequently,an exemplar image
is selected fromeach cluster,resulting in an exemplar set for the
c
-th subject,
E
c
=
f
e
c;
1
;
e
c;
2
;:::;
e
c;M
g
(4)
where
e
c;m
µ
z
c;m
.
121
For each test video
X
0
,each face image
x
0
i
(similar as denoted in (1)) is
a point feature of the unobserved data.Clusters are derived by partitioning
the test video sequence into temporally continuous segments of length
L
,
z
0
k
=
©
x
0
k;
1
;
x
0
k;
2
;:::;
x
0
k;L
ª
(5)
With that,the test cluster set for the
c
-th class is denoted by,
£
c
=
f
µ
µ
µ
c;k;
1

µ
µ
c;k;
2
;:::;µ
µ
µ
c;k;N
c
g
(6)
where the test cluster vector corresponding to the
i
-th frame is assigned the
122
k
-th cluster segment,
µ
µ
µ
c;k;i
=
z
0
k
based on its associated image
x
0
i
2
z
0
k
.
123
1
Assumes one training video per subject.If more than one training video is used,
image frames fromall similar-class videos can be aggregated sequentially before extracting
clusters.
6


2.2.Cluster and Exemplar Extraction
124
With the large amount of face variations in each training video,the non-
125
linear dimensionality reduction method,Locally Linear Embedding (LLE)
126
(Roweis and Saul,2000) is applied to learn a low-dimensional embedding
127
from its original data space.It is well known that LLE is capable of model-
128
ing the intrinsic structure of a nonlinear data manifold to capture meaningful
129
embedding to better capture various face variations such as pose,expression
130
and illumination.
131
Next,for each training video,the projected faces in LLE feature space
132
are partitioned into clusters using the recently proposed spatio-temporal hi-
133
erarchical agglomerative clustering (STHAC) algorithm (See and Eswaran,
134
2011),where consideration for both spatial and temporal distances tend to
135
produce better clusters.The global spatio-temporal fusion scheme of STHAC
136
is used in this work.
137
For each cluster,the face image nearest to cluster mean is selected as the
exemplar.The ¯nal exemplar set consists of the extracted exemplars from
all training videos,which can be succinctly summarized as
E
=
f
e
c;m
j
c
= 1
;:::;C
;
m
= 1
;:::;M
g
(7)
2.3.Representation
138
For all cluster sets,features can be described using a subspace representa-
tion spanned by the images in the cluster where distance between subspaces
or similarity measures can be derived.In our approach,each cluster
z
i
of
the
i
-th set is represented as a
d
-dimensional linear subspace by an ortho-
normal basis matrix
P
i
2
R
d
,obtained by the Singular Value Decomposition
7


(SVD) of
z
i
z
T
i
=
P
i
~
¤
P
T
i
,where
~
¤ and
P
i
are the eigenvalue and eigenvec-
tor matrices of the
d
largest eigenvalues,respectively.Assume
P
i
2
R
d
and
P
j
2
R
d
denote the orthonormal bases for two subspaces
z
i
and
z
j
,the SVD
of
P
T
i
P
j
2
R
d
£
d
is given as
P
T
i
P
j
=
Q
ij
¤
Q
T
ji
s
:
t
:
¤ =
diag
(
¾
1
;:::;¾
d
) (8)
where the singular values
¾
1
;:::¾
d
are the cosines of principal angles,also
139
commonly known as canonical correlations.This forms the basic workings
140
of the Mutual Subspace Method (MSM) where the cosine of the smallest
141
principal angle (or largest canonical correlation) is used as the similarity
142
measure between two subspaces.Other kernelized (Wolf and Shashua,2003)
143
and discriminant (Kim et al.,2007) variants are also used in the comparative
144
experiments presented in Section 4.1.
145
For the training exemplar set,we employ the recently proposed nonlinear
dimensionality reduction method called Neighborhood Discriminative Man-
ifold Projection (NDMP) (See and Ahmad Fauzi,2011).In brief,NDMP
seeks to learn an optimal low-dimensional projection by considering both
intra-class and inter-class reconstruction weights.Global structural and lo-
cal neighborhood constraints are imposed within a constrained optimization
problem,which can be solved as the generalized eigenvalue problem:
(
EM
intra
E
T
)
A
=
¸
(
EM
inter
E
T
+
EE
T
)
A
(9)
where training exemplar set
E
2
R
D
,and
M
intra
and
M
inter
are the intra-
146
class and inter-class orthogonal weight matrices respectively.A new test
147
sample
X
0
can be projected to embedded space
R
~
d
,by the linear transforma-
148
8


Cluster sets
Training sequence
Exemplar set
. . .
. . .
clustering by STHAC in LLE space
. . .
. . .
Figure 1:Cluster-centric extraction of subspace and point features from a training video
sequence.The cluster sets are characterized by linear subspaces while the exemplar set
consists of exemplar points projected to NDMP feature space.
tion
Y
0
=
A
T
X
0
where
~
d
¿
D
.Detailed theoretical formulation of NDMP
149
can be found in the work by See and Ahmad Fauzi (2011).
150
Fig.1 summarizes the extraction of subspace and point features from a
151
training video sequence through local class-speci¯c appearance-based clus-
152
ters.
153
3.Fusion approach to recognition
154
3.1.Dual-feature Bayesian classi¯er
155
The proposed recognition approach is formulated based on the evaluation
of class hypotheses set
H
=
f
H
c
g
,
c
= 1
:::C
,by Bayesian classi¯cation,
9


where the subject identity of a new sequence
X
can be found by estimating
the
maximu-a-posteriori
(MAP) decision rule
2
:
c
¤
=
arg
max
C
p
(
c
j
X
) (10)
In general,assuming conditional independence between all observations,
i.e.
x
i
??
x
j
j
c
where
i
6
=
j
,the posterior distribution over the class hypothe-
ses at time frame
N
at the end of test sequence
X
can be expressed by Bayes
rule,
p
(
c
j
X
)
´
p
(
c
j
x
1
;:::;
x
N
)
/
p
(
c
)
N
Y
i
=1
p
(
x
i
j
c
) (11)
This is the commonly known Naive Bayes classi¯er.
156
To further encode both subspace and point features,a Bayesian classi¯er
can be modelled in the form of joint probability functions.In our earlier
exemplar-driven approach (See et al.,2011),the joint probability function
involved only the use of exemplar points and the causal relationship between
exemplars and their respective classes,
p
(
c;
E
;
X
) =
p
(
X
j
c;
E
)
p
(
E
j
c
)
p
(
c
) (12)
From that,a new joint probability function is now proposed to fuse together
relevant similarities within local clusters at both point and subspace levels,
p
(
c;
E
;
Z
;
£
;
X
) =
p
(
X
j
E
;
£
)
p
(
£
j
Z
)
p
(
Z
j
c
)
p
(
E
j
c
)
p
(
c
)
:
(13)
The conditional dependencies among the point feature (
E
;
X
) and subspace
157
feature (
Z
;
£
) variables are encoded in the Bayesian framework.Graphical
158
2
For clarity purpose,we abuse the notation by simply denoting the test sequence as
X
,
contrary to earlier notation of
X
0
in Section 2.3.
10


models depicting both the exemplar-driven approach and the new approach
159
are shown in Fig.2.
160
To determine the class hypothesis at the end of a test video sequence,the
161
posterior probability to be maximized is rede¯ned as,
162
p
(
c
j
E
;
Z
;
£
;
X
)
=
p
(
X
j
E
;
£
)
p
(
£
j
Z
)
p
(
Z
j
c
)
p
(
E
j
c
)
p
(
c
)
P
c
0
p
(
X
j
E
;
£
)
p
(
£
j
Z
)
p
(
Z
j
c
0
)
p
(
E
j
0
)
p
(
c
0
)
/
p
(
c
)
N
Y
i
=1
p
(
x
i
j
E
;
£
)
p
(
£
j
Z
)
p
(
Z
j
c
)
p
(
E
j
c
)
/
p
(
c
)
N
Y
i
=1
p
(
x
i
j
E
)
p
(
x
i
j
£
)
p
(
£
j
Z
)
p
(
Z
j
c
)
p
(
E
j
c
) (14)
with the assumption of conditional independence between observations in
X
163
and law of total probability.Random variables
£
and
E
are also assumed
164
independent of each other as they both represent di®erent features types and
165
have no mutual causality for classi¯cation.
166
Two additional assumptions are asserted to further simplify it to the ¯nal
167
compact term.With the use of local clusters,each class is now represented
168
by
M
number of clusters and exemplars (see Eqs.(2) and (3)).Hence,
169
conditional probabilities involving
£
and
E
can be aggregated by sumrule for
170
consistency in our framework.Finally,
p
(
x
i
j
£
) and
p
(
Z
j
c
) are assumed non-
171
informative terms and treated as constants as they intuitively insigni¯cant
172
to the decision rule.With that,we obtain as follows:
173
11


p
(
c
j
E
;
Z
;
£
;
X
)
´
p
(
c
j
e
c;j
;
z
c;j

µ
µ
i
;
x
i
)
/
p
(
c
)
N
Y
i
=1
p
(
x
i
j
µ
µ
µ
i
)
M
X
j
=1
p
(
µ
µ
µ
i
j
z
c;j
)
p
(
x
i
j
e
c;j
)
p
(
e
c;j
j
c
)
p
(
z
c;j
j
c
)
/
p
(
c
)
N
Y
i
=1
M
X
j
=1
p
(
µ
µ
µ
i
j
z
c;j
)
p
(
x
i
j
e
c;j
)
p
(
e
c;j
j
c
) (15)
Due to the limited sample size in our problem setting,good estimations
174
of distribution can be challenging and easily result in over-¯tting or under-
175
¯tting of data.In this work,we exploit relevant similarity metrics involving
176
point and subspace features derived from within the local clusters (see Fig.
177
3 for illustration).Concretely,these similarity metrics are computationally
178
inexpensive and structurally intuitive for formulating the relevant probabil-
179
ities.The likelihoods
p
(
µ
µ
µ
i
j
z
c;j
) and
p
(
x
i
j
e
c;j
) re°ect the probabilities of test
180
point
x
and subspace
µ
µ
µ
at frame
i
given their respective trained image clusters
181
and exemplars.The conditional probability
p
(
e
c;j
j
c
) weigh the importance of
182
each exemplar with respect to its parent class.Since we have no information
183
on the correctness of hypothesis at the start of the observation sequence,
184
class priors
p
(
c
) are uniformly distributed over all the class hypotheses.
185
3.2.Cluster likelihood
p
(
µ
µ
µ
i
j
z
c;j
)
186
Subspace features extracted from both training and test cluster samples
187
can be utilized to determine similarity between clusters in the form of prin-
188
cipal angles or canonical correlations (as brie°y discussed in Section 2.3).
189
The cluster likelihood,or likelihood of the observed cluster subspace
µ
µ
µ
i
(corresponding to face image
x
i
) given the training cluster subspace
z
is as
12


N
x
i
c
E
(a)
N
x
i
c
E
θ
i
Z
(b)
Figure 2:Graphical models for (a) exemplar-driven,and (b) dual-feature Bayesian classi-
¯ers
follows,
p
(
µ
µ
µ
i
j
z
c;j
) = (1
¡
®
)
S
CL
i
(
µ
µ
µ
i
;
z
c;j
) +
®
(16)
where the normalized subspace (cluster) similarity metric is de¯ned by the
190
average of ¯rst
r
canonical correlations,
S
CL
i
(
µ
µ
µ
i
;
z
c;j
) =
P
r
l
=1
¾
l
=r
where
r <
191
d
.In our experiments,the value of
r
is ¯xed to obtain consistent correlation
192
values.Parameter
®
is the lower bound of the similarity metric (asserting
193
value range [
®;
1
:
0]),de¯ning its degree of sensitivity.
194
3.3.Exemplar likelihood
p
(
x
i
j
e
c;j
)
195
A point similarity metric is de¯ned as the inverse squared Mahalanobis
196
distance between the observed face image
x
i
and the
j
-th exemplar of class
197
c
in NDMP-projected embedded space,
S
EL
i
(
x
i
;
e
c;j
) = 1
=
(
x
i
¡
e
c;j
)
§
¡
1
(
x
i
¡
198
e
c;j
)
T
where
§
is the common covariance matrix for all class samples (in-
199
clusive of the test sample).The similarities are sum-normalized across all
200
13


x
x
x
x
x
x
x
x
x
x
x
x
+ +
+
+
+
+
+
+
+
+
+
+
*
x
i
S
CL

i
|z
c, j
)
Class 1
Class 2
S
EL
(x
i
|e
c, j
)
S
PR
(e
c, j
|E
c
)
e
c, j
θ
i
z
c, j
*
*
*
*
*
Input sample
Figure 3:Graphical illustration of similarity metrics involving exemplar points
(circles)
and cluster subspaces
(colored regions)
within local class-speci¯c clusters
classes.The choice of Mahalanobis metric here is assumed to intuitively
201
embody well-behaved multinormal sample distributions with a common co-
202
variance matrix.
203
The exemplar likelihood,or likelihood of the observed face image
x
i
given
the training exemplars
e
is formulated under stochastic selection rule as,
p
(
x
i
j
e
c;j
) =
S
EL
i
(
x
i
;
e
c;j
)
P
C
k
=1
P
M
j
=1
S
EL
i
(
x
i
;
e
c;j
)
(17)
3.4.Exemplar prominence
p
(
e
c;j
j
c
)
204
Causal relationship between exemplars and their respective parent classes
205
can be represented by the conditional probability
p
(
e
c;j
j
c
).Intuitively,these
206
14


conditional probabilities act as exemplar prominence weights for the exem-
207
plar likelihoods
p
(
x
i
j
e
c;j
),or coe±cients of in°uence of the exemplar within
208
its own class subspace.A point-to-subspace similarity metric is de¯ned by
209
the inverse
`
2
-Hausdor® distance from each exemplar
e
c;j
to its correspond-
210
ing class-speci¯c exemplar subspace
E
c
in NDMP space,
S
PR
c;j
(
e
c;j
;
E
c
) =
211
1
=
(min
e
0
2
E
c
k
e
c;j
¡
e
0
k
).
212
The
M
similarities of each
c
-th class are normalized by min-max normal-
ization which is a linear mapping to [0
;
1].Hence,the exemplar prominence
probability is formulated as,
p
(
e
c;j
j
c
) =
S
PR
c;j
(
e
c;j
;
E
c
)
P
M
j
=1
S
PR
c;j
(
e
c;j
;
E
c
)
(18)
This term can be pre-computed during training since it does not depend on
213
observation sample
X
.
214
4.Experiments
215
The proposed method is evaluated in a video-based recognition frame-
216
work on three public face video datasets.For each video sequence,faces
217
are extracted using the Viola and Jones (2001) cascaded face detector and
218
resampled to 32
£
32 pixel grayscale images.To cope with demanding head
219
poses and rotations in the datasets,the undetected faces are re-tracked and
220
localized using the robust IVT tracker (Ross et al.,2008) to ensure that the
221
experiments are thoroughly evaluated with all possible views considered.
3
222
3
The implementations of some works (Wang et al.,2008) directly obtain only detected
faces while ignoring undetected faces as outliers,excluding them from the ¯nal image set.
15


The face images are histogram equalized and no other pre-processing tasks
223
were applied prior to recognition.
224
Based on accepted practice in most works,one video sequence is used for
225
training and the remaining sequences are used for testing for each subject
226
class.We formulate an extensive evaluation procedure by creating an aug-
227
mented test set to introduce a variety of sequences with di®erent starting
228
frame positions and sequence lengths.The augmented test set is constructed
229
by randomly sampling
W
subsequences of
T
di®erent lengths from the orig-
230
inal test videos belonging to each subject class,for all
C
subject classes.
231
Thus,a total of
W
£
T
£
C
subsequences in the augmented test set were
232
sampled for our experiments.This evaluation procedure is designed to re-
233
duce video length bias and to better mimic realistic scenarios with arbitrary
234
sets of views.
235
The ¯rst dataset we used is the CMU MoBo (Gross and Shi,2001),one
236
of the earliest datasets used for video-based face recognition,adapted from
237
its original use for human identi¯cation from distance.It consists of 96
238
sequences of 24 di®erent people walking on a treadmill (4 videos per person).
239
Each video contains about 300 frames.The second dataset,Honda/UCSD
240
(Lee et al.,2005) was collected speci¯cally for face recognition in video and
241
is one of the more challenging public dataset.We consider their ¯rst subset,
242
which has 59 sequences of 20 di®erent people (each person has at least 2
243
videos).Each video contains about 300-600 frames,comprising of large pose
244
and expression variations with signi¯cant out-of-plane head rotations.The
245
third dataset is the recently collected ChokePoint dataset (Wong et al.,2011)
246
by NICTA for person identi¯cation under real-world surveillance conditions.
247
16


We consider their ¯rst subset (Portal 1),which has 600 sequences of 25
248
di®erent people.Each person has 24 di®erent sequences,comprising of a
249
combination of 4 sequence shots,3 camera angles,and 2 movement modes
250
(entering and leaving portal).The cropped face images provided with the
251
dataset contain variations in terms of illumination conditions and pose,video
252
quality,as well as misalignments due to the presence of slight occlusions.To
253
simplify our experimental set up for this dataset,we only sample sequences
254
taken by Camera 1 (single camera) from the leaving portal mode.
255
To generate the augmented test set,we use the following sampling pa-
256
rameters,in (
W;T;C
):CMU MoBo (12,5,20),Honda/UCSD (10,5,20)
257
and ChokePoint (9,5,25).
4
The number of subsequence lengths is ¯xed at
258
T
= 5 to provide fair and consistent evaluation across all tested datasets.
259
The subsequence lengths are determined by
T
equally distributed intervals
260
of the original sequence length.Meanwhile,
W
is arbitrarily decided on the
261
basis of creating su±cient samples for each tested class.
262
4.1.Comparative evaluation
263
In our experiments,we compared the performance of the proposed Dual-
264
Feature Bayes (DFB) classi¯er to the following methods:
265
²
Exemplar-based majority voting (MajVote) by Euclidean nearest neigh-
266
bor (NN) distance in PCA,LDA and NDMP projected spaces,with
267
vote taken at each frame
268
4
Details of the protocol for generating the augmented test set can be found at
http:
//pesona.mmu.edu.my/
»
johnsee/research/vfr
17


²
Exemplar-based probabilistic voting (ProbVote) in NDMP projected
269
space,where the NN distances are normalized to probabilities and ag-
270
gregated cumulatively by sum rule,with vote taken at each frame
271
²
Mutual subspace method (MSM)
272
²
Kernel principal angles (KPA)
273
²
Discriminative canonical correlations (DCC),with training for each
274
class conducted by (i) randomly partitioning into two class-speci¯c sets,
275
and (ii) partitioning into
M
class-speci¯c cluster sets as extracted by
276
STHAC
277
²
Manifold-manifold distance (MMD) method
278
²
Bayesian MAP classi¯ers { Naive Bayes (NB) and Exemplar-Driven
279
Bayes (EDB)
280
The MajVote,ProbVote,NB and EDB are exemplar-based methods,
281
MSM,KPA and DCC are image set-based methods,while MMD and DFB,
282
our proposed approach,are a feature-wise fusion of both types of schemes.
283
The optimal parameters for each approach were determined by experi-
284
ments.For the voting methods,PCA features are set to retain 95% of data
285
energy,while for both LDA and NDMP features,the dimension is set to
286
the number of classes minus one.The exemplar training set in all Bayesian
287
classi¯ers (NB/EDB/DFB) are projected to NDMP space,with dimension
288
similarly set to the number of classes minus one.The NDMP dimensionality
289
reduction method is selected for its ability to extract meaningful discrimina-
290
18


Figure 4:Selected exemplars of two subjects from CMU Mobo
(top two rows)
,
Honda/UCSD
(bottom two rows)
and ChokePoint
(middle two rows)
datasets.
tive features from the highly nonlinear training set manifold with complex
291
sample variability (See and Ahmad Fauzi,2011).
292
For methods that require clustering during training (Voting methods,
293
DCC with clusters,NB,EDB,DFB),the number of clusters per subject were
294
determined empirically (details can be found in the work by See and Eswaran
295
(2011)),
i.e.
M
= 7 for Honda/UCSD and
M
= 6 for CMU MoBo and
296
ChokePoint.In Fig.4,some sample exemplar sets extracted from training
297
videos of the datasets show the variability of face appearances obtained using
298
STHAC.
299
For MSM/KPA/DCC/DFB,the dimensionality of PCAsubspaces learned
300
for each image set is set to 10,representing about 98% of data energy of the
301
set.The kernel function used in KPA is a Gaussian radial basis function
302
(RBF) kernel,a promising choice suggested in the original authors'imple-
303
mentation (Wolf and Shashua,2003).For MMD,the weighting parameter
304
is set to 0.5 for equal weights,as suggested by Wang et al.(2008).Cluster
305
19


Table 1:Recognition rates (%) of the evaluated methods on three datasets and their
average across all datasets
CMU MoBo Honda/UCSD ChokePoint Average
MajVote-PCA 83.5 60.8 71.8 72.03
MajVote-LDA 93.6 64.5 87.1 81.73
MajVote-NDMP 94.7 68.7 92.7 85.37
ProbVote 95.8 70.8 89.9 85.50
MSM 84.4 63.7 67.5 71.87
KPA 69.6 77.8 68.2 71.87
DCC-RandomSplit 81.4 62.7 72.8 72.30
DCC-Clusters 86.0 74.0 86.7 82.23
NB 95.3 75.9 91.7 87.63
EDB 96.1 77.6 92.1 88.60
MMD 95.3 78.4 92.2 88.63
DFB 97.2 85.9 93.6 92.23
subspaces for DFB are represented using DCC while the temporal segment
306
length
L
and subspace similarity sensitivity
®
are set to good empirical values
307
of
L
= 20 and
®
= 0
:
75.
308
4.2.Results
309
The experimental results of the evaluated approaches (as listed in Sec-
310
tion 4.1) for all three datasets in terms of average recognition rate,are pre-
311
sented in Table 1.As observed,the proposed DFB approach obtained the
312
best performance among all evaluated approaches,for all three datasets.
313
Its e®ectiveness is most prominent in the challenging Honda/UCSD dataset
314
compared to other datasets,while it is also distinctly superior when results
315
20


are averaged across all datasets (92.23%).The performance of the image
316
set-based approaches were noticeably poor in the Honda/UCSD and Choke-
317
Point datasets possibly due to their inadequacy in characterizing complex and
318
rapidly changing head poses.Overall,approaches that employ the described
319
Bayesian framework (NB,EDB,DFB) yielded better performance than con-
320
ventional voting schemes.Also,the feature fusion approaches (MMD,DFB)
321
seemed to produce promising results.In the case of DFB,the uni¯cation of
322
features fromboth video frames (point) and video segments (subspace) using
323
a temporally-driven classi¯cation framework,provide a holistic representa-
324
tion for improved recogniton.
325
4.3.Further Discussions
326
On closer examination of di®erent video subsequence lengths (using
T
=
327
10),the DFB approach gradually outperformed the other approaches as the
328
subsequence length increases,as shown in Fig.5 on the Honda/UCSD
329
dataset.This is a strong characteristic of our Bayesian framework where
330
temporally accummulated posteriors can increase the accuracy of converging
331
towards a candidate subject.On the other hand,the performance of im-
332
age set-based approaches appeared highly erratic and they are not able to
333
achieve recognition rates as high as the Bayesian methods.Since the variabil-
334
ity of faces increase in longer videos,characterizing entire videos as image set
335
subspaces does not necessarily ensure that ¯ner appearance information are
336
well captured.Likewise,exemplar-based methods su®er from the rigidity of
337
frame-wise classi¯cation (more so using crude voting strategies),ignoring the
338
usefulness of set representations at a coarser granularity.For a general sum-
339
mary of the one-to-many identi¯cation performance,the cumulative match
340
21


25
50
75
100
125
150
175
200
225
250
30
40
50
60
70
80
90
100
Subsequence Length
Average Recognition Rate (%)
MajVote-NDMP
MSM
DCC-RandomSplit
DCC-Clusters
EDB
MMD
DFB
Figure 5:Results of selected methods using di®erent test subsequence lengths on the
Honda/UCSD dataset.
curves of the evaluated methods shown in Fig.6 demonstrate the robustness
341
of the proposed approach across rank.
342
Though it is possible to improve the overall recognition performance of the
343
evaluated methods by applying state-of-the-art illumination normalization
344
techniques in the pre-processing step (such as one proposed by Tan and
345
Triggs (2010)),we abide by the widely accepted practice of normalizing pixel
346
intensities by histogram equalization across all frames in a video sequence,
347
which is common in many notable works in literature (Arandjelovic et al.,
348
2005;Wang et al.,2008;Harandi et al.,2011).Likewise,this work addresses
349
the classi¯cation aspect of the VFR problem while ¯xing to constant the
350
methods applied to all other steps.
351
Brie°y on the computational cost,we report an average classi¯cation
352
computation time of 0.103
s
per sequence (average sequence length of 150
353
frames) for the DFB on the Honda/UCSD (10,5,20) using a Pentium IV 2.8
354
22


GHz,1.5 Gb RAMmachine.In comparison to the usage of dual features,sin-
355
gle feature methods obtained mixed performances
i.e.
exemplar-feature-only
356
methods { NB:(0.078
s
),EDB:(0.093
s
),and image-set-feature-only meth-
357
ods { MSM (0.163
s
),DCC-RandomSplit (0.236
s
),DCC-Clusters (0.755
s
).
358
The DFB also fared better than the MMD (0.256
s
) in terms of classi¯ca-
359
tion speed.This is strong indication that the proposed Bayesian framework
360
employed for recognition remains a key factor behind its e±ciency.The cal-
361
culation of the similarity metrics involved is inexpensive and does not incur
362
much computational overhead,with the exception of the cluster likelihood,
363
which can be pre-computed prior to matching.This burden of computation
364
is passed to the training stage where the DFB takes 83.987
s
,or about 10
2
365
times the duration needed to train the EDB (0.813
s
),this in return for a
366
4% improvement in accuracy.DFB is also only marginally longer than the
367
training time required for DCC-Clusters approach (77.328
s
).Nonetheless,it
368
is worth noting that e±ciency in the classi¯cation step (in testing stage) is
369
ultimately important for practical applications.
370
5.Conclusion
371
We propose a novel dual-feature classi¯cation approach to face recogni-
372
tion in video that fuses an essential set of similarities derived within local
373
appearance-based clusters.Both subspace and point features characterize
374
a diverse set of representations,whereby the learned subspaces capture in-
375
trinsic variations at the subspace level while the selected exemplars encode
376
the appearance information of the trained subject at the point level.Within
377
the Bayesian
maximum-a-posteriori
(MAP) classi¯cation framework,a joint
378
23


probability function is modelled using relevant similarity metrics to recognize
379
the identity of subjects in video.In a comprehensive evaluation on three face
380
video datasets { CMU MoBo,Honda/UCSD and Chokepoint,the proposed
381
approach demonstrated promising results in comparison with conventional
382
single feature approaches.Our work opens up future directions towards the
383
application of a compact Bayesian similarity-based framework for various
384
temporally-driven data,and also further exploration into di®erent feature
385
representations that can be leveraged.
386
References
387
Arandjelovic,O.,Shakhnarovich,G.,Fisher,J.,Cipolla,R.,Darrell,T.,
388
2005.Face recognition with image sets using manifold density divergence.
389
In:Proc.of IEEE Computer Vision and Pattern Recognition.pp.581{588.
390
Fan,W.,Wang,Y.,Tan,T.,2005.Video-based face recognition using
391
bayesian inference model.In:Proceedings of AVBPA.Springer-Verlag,
392
pp.122{130.
393
Fan,W.,Yeung,D.-Y.,2006.Locally linear models on face appearance man-
394
ifolds with application to dual-subspace based classi¯cation.In:Proc.of
395
IEEE Int.Conf.on Computer Vision and Pattern Recognition.pp.1384{
396
1390.
397
Fukui,K.,Yamaguchi,O.,2003.Face recognition using multi-viewpoint pat-
398
terns for robot vision.In:Proc.of International Symposium of Robotics
399
Research.pp.192{201.
400
24


Gross,R.,Shi,J.,2001.The cmu motion of body (mobo) database.In:
401
Technical Report CMU CMU-RI-TR-01-18.Robotics Institute,CMU.
402
Hadid,A.,PeitikÄainen,M.,2004.From still image to video-based face recog-
403
nition:An experimental analysis.In:Proceedings of IEEE Automatic Face
404
and Gesture Recognition.pp.813{818.
405
Harandi,M.T.,Sanderson,C.,Shirazi,S.A.,Lovell,B.C.,2011.Graph
406
embedding discriminant analysis on grassmannian manifolds for improved
407
image set matching.In:Proc.of IEEE Conference on Computer Vision
408
and Pattern Recognition.pp.2705{2712.
409
Kim,T.,Kittler,J.,Cipolla,R.,2007.Discriminative learning and recogni-
410
tion of image set classes using canonical correlations.IEEE Trans.PAMI
411
29 (6),1005{1018.
412
KrÄueger,V.,Zhou,S.,2002.Exemplar-based face recognition from video.In:
413
Proceedings of European Conference on Computer Vision.pp.732{746.
414
Lee,K.,Ho,J.,Yang,M.,Kriegman,D.,2005.Visual tracking and recogni-
415
tion using probabilistic appearance manifolds.Computer Vision and Image
416
Understanding 99 (3),303{331.
417
Li,S.Z.,Jain,A.K.,2011.Handbook of Face Recognition,2nd Edition.
418
Springer.
419
Phillips,P.J.,Flynn,P.J.,Scruggs,T.,Bowyer,K.W.,Chang,J.,Ho®man,
420
K.,Marques,J.,Min,J.,Worek,W.,2005.Overviewof the face recognition
421
25


grand challenge.In:Proceedings of IEEE Conference on Computer Vision
422
and Pattern Recognition.Washington DC,USA,pp.947{954.
423
Poh,N.,Chan,C.H.,Kittler,J.,Marcel,S.,Mc Cool,C.,Argones R¶ua,
424
E.,L.,A.C.J.,Villegas,M.,Paredes,R.,Struc,V.,Pavesic,N.,Salah,
425
A.A.,Fang,H.,Costen,N.,2010.An evaluation of video-to-video face ver-
426
i¯cation.IEEE Transactions on Information Forensics and Security 5 (4),
427
781{801.
428
Ross,D.,Lim,J.,Lin,R.-S.,Yang,M.-H.,2008.Incremental learning for
429
robust visual tracking.Int.Journal of Computer Vision 77 (1),125{141.
430
Roweis,S.,Saul,L.,2000.Nonlinear dimensionality reduction by locally
431
linear embedding.Science 290,2323{2326.
432
See,J.,Ahmad Fauzi,M.F.,2011.Learning neighborhood discriminative
433
manifolds for video-based face recognition.In:Proc.of International Con-
434
ference on Image Analysis and Processing (ICIAP).Springer,Ravenna,
435
Italy,pp.247{256.
436
See,J.,Ahmad Fauzi,M.F.,Eswaran,C.,2011.Video-based face recognition
437
using exemplar-driven bayesian network classi¯er.In:Proc.of Int.Conf.
438
on Signal and Image Processing Applications.Kuala Lumpur,Malaysia,
439
pp.372{377.
440
See,J.,Eswaran,C.,2011.Exemplar extraction using spatio-temporal hier-
441
archical agglomerative clustering for face recognition in video.In:Proc.of
442
International Conference on Computer Vision (ICCV).Barcelona,Spain,
443
pp.1481{1486.
444
26


Shakhnarovich,G.,Fisher,J.,Darrell,T.,2002.Face recognition from long-
445
termobservations.In:Proceedings of European Conf.on Computer Vision.
446
pp.851{868.
447
Tan,X.,Triggs,B.,2010.Enhanced local texture feature sets for face recog-
448
nition under di±cult lighting conditions.IEEE Transactions on Image
449
Processing 19 (6),1635{1650.
450
Viola,P.,Jones,M.,2001.Rapid object detection using a boosted cascade
451
of simple features.In:Proceedings of IEEE Computer Vision and Pattern
452
Recognition.pp.511{518.
453
Wang,R.,Shan,S.,Chen,X.,Gao,W.,2008.Manifold-manifold distance
454
with application to face recognition based on image set.In:Proceedings
455
of IEEE Computer Vision and Pattern Recognition.
456
Wolf,L.,Shashua,A.,2003.Learning over sets using kernel principal angles.
457
Journal of Machine Learning Research 4,913{931.
458
Wong,Y.,Chen,S.,Mau,S.,Sanderson,C.,Lovell,B.C.,2011.Patch-
459
based probabilistic image quality assessment for face selection and im-
460
proved video-based face recognition.In:Computer Vision and Pattern
461
Recognition Workshops (CVPRW).
462
Yamaguchi,O.,Fukui,K.,Maeda,K.,1998.Face recognition using temporal
463
image sequence.In:Proc.of IEEE Automatic Face and Gesture Recogni-
464
tion.pp.318{323.
465
27


Zhao,W.,Chellappa,R.,Phillips,P.,Rosenfeld,A.,2003.Face recognition:
466
A literature survey.ACM Computing Surveys 35 (4),399{485.
467
Zhou,S.,KrÄueger,V.,Chellappa,R.,2003.Probabilistic recognition of hu-
468
man faces fromvideo.Computer Vision and Image Understanding 91 (1{2),
469
214{245.
470
2
4
6
8
10
12
14
16
18
20
60
65
70
75
80
85
90
95
100
Rank
Average Recognition Rate (%)
MajVote-NDMP
MSM
DCC-RandomSplit
DCC-Clusters
EDB
MMD
DFB
(a)
2
4
6
8
10
12
14
16
18
20
60
65
70
75
80
85
90
95
100
Rank
Average Recognition Rate (%)
MajVote-NDMP
MSM
DCC-RandomSplit
DCC-Clusters
EDB
MMD
DFB
(b)
2
4
6
8
10
12
14
16
18
20
60
65
70
75
80
85
90
95
100
Rank
Average Recognition Rate (%)
MajVote-NDMP
MSM
DCC-RandomSplit
DCC-Clusters
EDB
MMD
DFB
(c)
Figure 6:Rank-based evaluation of selected methods on the (a) CMU MoBo,(b)
Honda/UCSD,and (c) ChokePoint datasets
28


Highlights



We present a new dual
-
feature approach to face recognition in video.



Subspace and point features within local appearance
-
based clusters are utilized.



Relevant similarity metrics

are formulated to model a Bayesian MAP classifier.



We report promis
ing results based on extensive evaluation on face video datasets.