Tra
cking
f
ace
p
oses toward
m
eeting
vid
eo
a
nalysis
Ligeng Dong
*
,
Linmi Tao,
Guangyou Xu
Department of Computer Science and Technology, Tsinghua University, Beijing, P.R.China
ABSTRACT
We perform face tracking and pose estimation jointly within a mixed

stat
e particle filter framework. Previous methods
often used generative appearance models and naive prior state transition. We propose to use discriminating models,
Adaboosted face detectors, to both measure observations and provide information for the proposa
l distribution which is
combined with detection responses and prior transition model. Due to pose continuity, faces between discrete poses can
be detected by neighboring pose

specific detectors and serve as importance samples. Thus continuous poses are obt
ained
instead of discrete poses. Experiments show that our method is robust to large location and pose changes, partial
occlusions and expressions.
Keywords:
Face tracking, pose estimation, particle filter
1.
INTRODUCTION
Face poses are important cues to inf
er people’s visual focus of attention in meetings. Traditional face pose estimation is
based on 3D models where high resolution face images are used. However, meeting videos are usually recorded by
distant cameras and only low resolution faces are captured
. So we prefer to make use of 2D appearance based methods.
There are two major issues on face pose tracking: the modeling of face poses and the framework of tracking.
In order to model the large appearance variation of faces due to pose changes, pose

spec
ific models are usually needed.
Several pose

specific face models have been applied in face pose tracking. They can be divided roughly into two types.
One type is generative models such as PCA[5] and exemplar based methods[1]. The major disadvantage of gen
erative
models is the heavy computation load which is not suitable for real time applications. The other type is discriminating
models like NN[10], SVM[11]. Currently, Adaboost[13] is the most efficient discriminating method on face detection. A
five

pose
face detector based on Adaboost was integrated for head tracking in [7]. However, they focused on head
tracking rather than pose estimation. In our work, 9 poses are defined from

90° to 90° with a step of 22.5°. We propose
to model faces of each pose by a
specific face detector. In order to better describe non

frontal faces, Asymmetric
Rectangle Features [12] are used to train the cascade of classifiers. Our multiple pose

specific face detectors are different
from other multi

view face detection methods [3
]. Previous methods usually classify one face patch to a fixed pose,
while in our method one face patch may be classified as face by more than one pose

specific face detectors. This is
because pose change is continuous and smooth, where the border between
two discrete neighboring poses is not very
clear. So it is reasonable for a face patch to be classified as both poses if its true pose is between two discrete
neighboring poses.
In a traditional face pose tracking application, face tracking and pose estim
ation are commonly organized sequentially so
pose estimation is dependent on results of the face tracker. Thus a bad aligned face box may result in wrong pose
estimation. To tackle this problem, face tracking and pose estimation should be considered simult
aneously. Lee et al [5]
presented a way of performing face tracking and pose subspace recognition iteratively. Like [1], we use a mixed

state
particle filter[6] to couple the face tracking and pose estimation in a probabilistic framework.
In the standard p
article filter[4][2], naive transition prior is used as proposal distribution without considering recent
observations; so many particles may be wasted at low likelihood areas. Proposal distributions that integrate recent
observations can draw good samples
and perform better than the naive transition prior[9]. Okuma et al [8] trained a
cascade of classifiers for hockey players to serve as good proposal distributions rather than observation models. Aspired
by [8], we propose to construct proposal distributio
ns by combining detection responses of pose

specific face detectors
and the transition prior. In our work, the detectors are also used as the observation model.
*
dongligeng99@mails.thu.edu.cn
;
phone
+86 10 62797002

804
; fax
+86 10 62781118
;
http://media.cs.tsinghua.edu.cn
In this paper, within a mixed

state particle filter framework to perform face tracking and pose
estimation simultaneously,
we propose multiple pose

specific face detectors based on Adaboost to model the face appearance variation of different
poses. Responses from neighboring face detectors are combined with the transition prior to get a better propo
sal
distribution. Each face detector is trained with
a
symmetric
r
ectangle
f
eatures
(
ARFs
)
to better describe the profile faces.
The rest of this paper is organized as follows. In Section 2, the multiple pose

specific face models using ARFs are
presented. S
ection 3 describes the mixed

state particle filter framework with importance sampling from our proposal
distribution. Experiment results are reported in Section 4 and conclusions are drawn in Section 5.
2.
MULTI

POSE APPEARANCE MODE
LS
We use multiple pose

spe
cific face detectors as our appearance models. In order to better describe non

frontal faces,
a
symmetric
r
ectangle
f
eatures (ARFs) [12] are used to train the cascade of classifiers for each pose. The rectangle feature
set includes 3 types of
rectangle feat
ure
s (See Figure1).
Fig.1. Asymmetric features in profile faces and the
rectangle feature
s
et
. The left figure illustrates the asymmetric features
and the right figure illustrates the three types of asymmetric rectangle features we adopted.
Different
structures for multi

view face detection are discussed in [3]. These methods detected and classified a face
sample to a definite pose, where only five discrete poses are defined. However, our point is that in a video stream, face
poses may change continuou
sly. For a face between two neighboring discrete poses, it is very hard to tell which pose
class this face belongs to, since this face is similar to both poses. In this light, it is reasonable to set the pose of this
face
as a pose between the two neighbor
ing discrete poses. Thus, we adopt multiple pose

specific face detectors as our
appearance models. Figure 2 illustrates the structure of our face detectors.
Fig.2. Structure of multiple pose

specific face detectors
. Each face detector will detect faces o
f poses within a certain range.
For each pose, a cascade of classifiers is trained individually using Adaboost algorithm. So we obtain M pose

specific
face detectors (M is the number of poses, say M=9 in this study). Wu et al [13] also trained different ca
scades for each
pose. During detection, an input window will pass the first three layers of all the detectors, and then the pose will be
estimated by selecting the one with the highest confidence. Afterwards the input window will go through all the
remaini
ng layers. Our detection scheme is different from [13] in that the input window may go through all the layers of
the possible pose

specific face models before it is rejected as non

face. Since pose change is continuous and smooth, and
the border between tw
o discrete neighboring poses is not very clear, it is quite possible and reasonable that an input
window will pass all the layers of more than one detector, thus being detected as faces of different poses. And it is also
very common that for a face whose t
rue pose is between two discrete neighboring poses, neighboring pose

specific face
detectors may both have responses around this face area.
3.
A MIXED

STATE PARTICLE FILTE
R
A mixed

state particle filter is adopted where the object state contains continuous sp
atial parameters and discrete pose
parameter. Subsequently this particle filter is boosted via using the pose

specific face detectors mentioned above to
construct the proposal distribution (importance sampling function). In the following subsections, relat
ed elements of this
boosted mixed

state particle filter will be discussed.
3.1
Mixed
s
tate
s
pace and
o
bservation
m
odel
The state of p
article
f
ilter
is a mixed
variable
(,)
X x k
. The continuous
variable
x
is d
e
fined a
s
(,,)
x y
x t t s
, where
(,)
x y
t t
specifies the center of the face square and
s
specifies the side
length
of the face square. The discrete
variable
( 1,...,)
k k M
specifies
the pose index
to which the current observation b
e
longs, where
M
is the pose number.
Suppose input image
is
I
, and
( )
I x
is the image patch extracted from
I
accor
d
ing to the
spatial parameters
x
,
then
the
likelihood
(  )
p I X
is mo
d
eled as
(  ) ( ,) ( ( )  )
p I X p I x k p I x k
(1)
T
o
calculate
( ( )  )
p I x k
,
the
probability
of
( )
I x
belonging to the
k
th
pose

specific
face model
, we adopted the
likelihood
function
proposed in [6].
In this study, each pose

specific face model is a cascade of classifiers trained in
section 2. Suppose the total number of layers in
k
th
model
’
s cascade
is
k
N
,
and
k
n
is max
i
mum layer that an input
window has passed. For simplicity, we assume that the likelihood of the input window belonging to the model is related
to
/
k k
n N
.
Specifically
, the likel
i
hood is defined as
2
1/
1
(  ) ( ( )  ) exp( )
2
k k
k
n N
p I X p I x k
(
2
)
where
k
is the standard
deviatio
n of model
k
and
the
normalization term
.
3.2
Dynamical
m
odel
The dynamic of the state is modeled as a first

order Ma
rkov process
1
(  )
t t
p X X
. For the mixed state, we assume that
the two components of the state are independent and at time t face pose
t
k
is only dependent on the pose
1
t
k
at the
previous time.
Then the transition density is
1 1 1 1 1
(  ) (,,) (  ) (  )
t t t t t t t t t t
p X X p x k x k p k k p x x
(
3
)
The dynamic of the
continuous
variable
x
is modeled as the zero

order
Gaussian
diffusion. Considering the
continuous
pose
transition, the dynamic of the discrete var
iable
k
is described as
1
1 1
1 2
(  ) 0,   1
(  ),   1
(  ),
t t
t t
t t
p k i k j if i j
p k i k j c if i j
p k i k j c if i j
(
4
)
and
1
(  ) 1
t t
i
p k i k j
.
Usually
c2
is bigger than
c1
since one often keep
s
at one fixed pose and change to other
poses
occ
a
sionally (e.g. i
magine the situation
where
one is talking with
different people
)
.
3.3
Objective
f
unction of
f
ace
p
ose
t
racking
For face pose tracking in a video sequence
{
It
,
t
=1,2,
…
}, at time
t
,
the
goal of trac
k
ing is to solve the MAP problem:
1:
argmax (  )
t
t t t
X
X p X I
(
5
)
Following
Bayesian rule
,
the posterior probabilit
y
of
Xt
is inferred over time as
1
1
1:1 1 1 1 1:1
1
(  ) ( ( )  ) (  ) (  ) (, )
t
t
M
t t t t t t t t t t t t
k
x
p X I p I x k p k k p x x p x k I
(
6
)
S
uppose each pose
model
is ass
o
ciated with a pose angle
{:}
k
k
, the estimated pose at time t is
1:
ˆ
(  )
k
t t t
k
p k I
(
7
)
3.4
Importance
s
amp
ling
f
unction
We integrate the pose

specific face
detection
results to
construct
the proposal distrib
u
tion. For each image at time
t
, after
p
erform
ing face
detection
with our
pose

specific
face detector
s, there may be more than one face responses with
diff
erent pose labels. Suppose
(,)
l l l
t t t
X x k
is the
l
th
detection
response at time t, where
l
t
x
denotes the sp
a
tial
parameter of a face response and
l
t
k
denotes the corresponding pose model. The impor
tance function
1
( ,)
t t t
I X X I
is
thus approximated as a mixture of Dirac
fun
c
tions
( )
l
t
l
X X
. Our proposal distribution is the mixture of the
importance function from face detection results and prior transition density, as fol
lows
t 0:1 1:1 1
( ,) ( ,) (1 ) (  )
t t t t t t t
q X X I I X X I p X X
(
8
)
w
here
is the parameter balancing the two components.
The
parameter
can be adapted dynamically according to the detection results. If there is no detection respon
se
due to
noises
, we can set
0
, and the proposal distribution is prior transition distribution. If there are a lot of responses, we
increase
so that more importance will be placed on the dete
c
tion results
.
D
ue to
large
occlusions
or abrupt location
and pose changes, lost track will be declared if the accumulated face likelihoods are lower than a threshold, then the
track will be reinitialized using face detectors to search in larger areas and poses in the
following frames.
3.5
Outline of our algorithm
Generate
,,
i i i
t t t
x k
from
1 1 1
,,
i i i
t t t
x k
.
1,
Compute the importance function
1
( ,)
t t t
I X X I
.
2,
Resampling
. Resample
1 1
,
i i
t t
x k
to get
1 1
,
i i
t t
x k
3,
Prediction
.
F
or each
1 1
,
i i
t t
x k
:
G
enerate a uniformly distributed
number
[0,1]
.
(a)
if
,sample
,
i i
t t
x k
fro
m
1
( ,)
t t t
I X X I
,
set
i
t
as
1 1 1 1
1
1 t
( (,)  (,))
( ,)
N
j i i j j
t t t t t t t
j
i
t
i i
t t
p X x k X x k
I X X I
(b)
if
,sample
,
i i
t t
x k
from
1
(  )
t t
p X X
,
set
1
i
t
4,
Measurement
.
Compute
the
likelihood
t
( ,)
i i
t t
p I x k
. Update the weight of
each particle as
(I ,)
i i i i
t t t t t
p x k
.
T
hen normalize the weights so that
1
i
t
.
4.
EXPERIMENTS
4.1
Data
c
ollection
In our experiment, we use 9 pose models. Each pose model represents a rotation angle which ranges from

90 to +90
degree
s with a step of 22.5 degrees. Figure 3 shows an example of the 9 poses in our experiment.
Fi
g.3. Our face models of 9 poses which ranges from

90 to
+
90
degrees with a step of 22.5 degrees.
The data is collected as follows. We put markers on the wall wh
ich indicate the direction of the 9 poses. The person sits
in the center of the room and is asked to look at each marker successively by moving his head. The whole process is
recorded into a video. The person stops at each marker for a little while so that
we know the pose in the video. Training
data was collected by manually selecting images around the defined poses, so that our multiple pose

specific face models
can represent the faces with continuous poses. For example, the image set for pose 0° may cont
ain images with poses
from

10° to 10°. The negative sample set contains the background image, the images with only half face and the
positive patterns of non

neighboring poses. For example, the positive samples of pose 0° are negative samples of 6 non

nei
ghboring poses [

90°,

67.5°,

45°, 45°, 67.5°, 90°]. Figure 4 shows the training samples of pose 0°. We collected
training data of 10 individuals.
Fig.4. Training data of pose 0°.
The l
eft are positive samples;
The r
ight
are
the negative samples. The sa
mple size is 24*24.
4.2
Face
p
ose
t
racking
We applied our method to videos of people in the training set. Figure 5(a) shows some tracking results of a sequence
with intensive pose and location changes. Figure 5(b) is the estimated weighted poses. Track is lost
in some frames with
full occlusion but will be reinitialized when the face appears.
We also tested the method on people not in the training set. Figure 6 shows the results of a real meeting discussion video,
where the person is talking with 3 other peop
le. The person shakes his head showing his disagreement with others which
cause the vibration of pose numbers in figure 6(b). Experiments show that our method is robust for expression changes,
partial occlusions and small head tilt and rotation in plane.
Fig.5. Tracking results
of faces
with location
change
and
large
pose
variations.
Fig.6. Tracking results of
faces
in
a
group conversation
with head shakes and head nods.
5.
CONCLUSION AND FUTUR
E WORK
We proposed to address the problem of face pose trac
king within a mixed

state particle filter. We combined the face
detection results and the prior transition model to construct the proposal distribution, which helps generate good samples.
Our method can obtain continuous face pose tracking result and is ro
bust to large pose changes, partial occlusions and
expressions.
In future work, the f
ace poses will be used for recognizing visual focus of attention
and head gestures
of participants
in
meeting video analysis.
ACKNOWLEDGEMENT
S
This work is supported by N
ational Natural Science Foundation of China under grants No. 60673189.
REFERENCES
1.
S.O. Ba and J.M Odobez, Head pose tracking and focus of attention recognition algorithms in meeting rooms.
CLEAR 2006
2.
Doucet, A., de Freitas, J. F. G., N. J. Gordon, editors:
Sequential Monte Carlo Methods in Practice. Springer

Verlag,
New York (2001)
3.
Huang, C., Ai, H., Li, Y., Lao, S.: Vector boosting for rotation invariant multi

view face detection. In: ICCV. (2005)
4.
Isard, M., Blake, A.: Condensation
–
conditional density pr
opagation for visual tracking. International Journal on
Computer Vision, 28(1):5

28 (1998)
5.
Lee, K.C., Ho, J., Yang, M.H., Kriegman, D.: Video

Based Face Recognition Using Probabilistic Appearance
Manifolds. In: CVPR. (2003)
6.
Li, P., Wang,H.:Probabilistic Fa
ce Tracking Using Boosted Multi

view Detector. PCM(2004)
7.
Li, Y., Ai, H., Huang, C., Lao, S.: Robust Head Tracking Based on a Multi

State Particle Filter. In: FG2006, pp.335

340, Southampton, UK, April 10

12. (2006)
8.
Okuma, K.,et al.: A boosted particle filt
er: Multitarget detection and tracking. In: ECCV. (2004)
9.
Rui, Y., Chen, Y.: Better Proposal Distributions: Object Tracking Using Unscented Particle Filter. IEEE Conference
on Computer Vision and Pattern Recognition, pp. 786

793 (2001)
10.
Voit, M., Nickel, K.,
Stiefelhagen, R.:Multi

view Head Pose Estimation using Neural Networks. In: Computer and
Robot Vision. (2005)
11.
Wang, P., Ji, Q.:Multi

View Face Tracking with Factorial and Switching HMM. In Proc 7th IEEE Workshops on
Application of Computer Vision. (2005)
12.
Wang, Y., Liu, Y., Tao, L., Xu, G.: Real

Time Multi

View Face Detection and Pose Estimation in Video Stream. In
ICPR 2006. vol.4, no.pp. 354

357, 20

24 Aug. (2006)
13.
Wu, B., Ai, H., Huang, C., Lao, S.: Fast rotation invariant multi

view face detection based
on real adaboost. In: Intl.
Conf. on Automatic Face and Gesture Recognition. (2004)
Comments 0
Log in to post a comment