Tracking face poses toward meeting video analysis

connectionviewAI and Robotics

Nov 17, 2013 (3 years and 11 months ago)

80 views





Tra
cking
f
ace
p
oses toward
m
eeting
vid
eo
a
nalysis


Ligeng Dong
*
,
Linmi Tao,
Guangyou Xu

Department of Computer Science and Technology, Tsinghua University, Beijing, P.R.China

ABSTRACT

We perform face tracking and pose estimation jointly within a mixed
-
stat
e particle filter framework. Previous methods
often used generative appearance models and naive prior state transition. We propose to use discriminating models,
Adaboosted face detectors, to both measure observations and provide information for the proposa
l distribution which is
combined with detection responses and prior transition model. Due to pose continuity, faces between discrete poses can
be detected by neighboring pose
-
specific detectors and serve as importance samples. Thus continuous poses are obt
ained
instead of discrete poses. Experiments show that our method is robust to large location and pose changes, partial
occlusions and expressions.

Keywords:

Face tracking, pose estimation, particle filter


1.

INTRODUCTION

Face poses are important cues to inf
er people’s visual focus of attention in meetings. Traditional face pose estimation is
based on 3D models where high resolution face images are used. However, meeting videos are usually recorded by
distant cameras and only low resolution faces are captured
. So we prefer to make use of 2D appearance based methods.
There are two major issues on face pose tracking: the modeling of face poses and the framework of tracking.

In order to model the large appearance variation of faces due to pose changes, pose
-
spec
ific models are usually needed.
Several pose
-
specific face models have been applied in face pose tracking. They can be divided roughly into two types.
One type is generative models such as PCA[5] and exemplar based methods[1]. The major disadvantage of gen
erative
models is the heavy computation load which is not suitable for real time applications. The other type is discriminating
models like NN[10], SVM[11]. Currently, Adaboost[13] is the most efficient discriminating method on face detection. A
five
-
pose
face detector based on Adaboost was integrated for head tracking in [7]. However, they focused on head
tracking rather than pose estimation. In our work, 9 poses are defined from
-
90° to 90° with a step of 22.5°. We propose
to model faces of each pose by a

specific face detector. In order to better describe non
-
frontal faces, Asymmetric
Rectangle Features [12] are used to train the cascade of classifiers. Our multiple pose
-
specific face detectors are different
from other multi
-
view face detection methods [3
]. Previous methods usually classify one face patch to a fixed pose,
while in our method one face patch may be classified as face by more than one pose
-
specific face detectors. This is
because pose change is continuous and smooth, where the border between
two discrete neighboring poses is not very
clear. So it is reasonable for a face patch to be classified as both poses if its true pose is between two discrete
neighboring poses.

In a traditional face pose tracking application, face tracking and pose estim
ation are commonly organized sequentially so
pose estimation is dependent on results of the face tracker. Thus a bad aligned face box may result in wrong pose
estimation. To tackle this problem, face tracking and pose estimation should be considered simult
aneously. Lee et al [5]
presented a way of performing face tracking and pose subspace recognition iteratively. Like [1], we use a mixed
-
state
particle filter[6] to couple the face tracking and pose estimation in a probabilistic framework.

In the standard p
article filter[4][2], naive transition prior is used as proposal distribution without considering recent
observations; so many particles may be wasted at low likelihood areas. Proposal distributions that integrate recent
observations can draw good samples
and perform better than the naive transition prior[9]. Okuma et al [8] trained a
cascade of classifiers for hockey players to serve as good proposal distributions rather than observation models. Aspired
by [8], we propose to construct proposal distributio
ns by combining detection responses of pose
-
specific face detectors
and the transition prior. In our work, the detectors are also used as the observation model.




*

dongligeng99@mails.thu.edu.cn
;

phone
+86 10 62797002
-
804
; fax
+86 10 62781118
;
http://media.cs.tsinghua.edu.cn





In this paper, within a mixed
-
state particle filter framework to perform face tracking and pose

estimation simultaneously,
we propose multiple pose
-
specific face detectors based on Adaboost to model the face appearance variation of different
poses. Responses from neighboring face detectors are combined with the transition prior to get a better propo
sal
distribution. Each face detector is trained with
a
symmetric
r
ectangle
f
eatures
(
ARFs
)

to better describe the profile faces.

The rest of this paper is organized as follows. In Section 2, the multiple pose
-
specific face models using ARFs are
presented. S
ection 3 describes the mixed
-
state particle filter framework with importance sampling from our proposal
distribution. Experiment results are reported in Section 4 and conclusions are drawn in Section 5.

2.

MULTI
-
POSE APPEARANCE MODE
LS

We use multiple pose
-
spe
cific face detectors as our appearance models. In order to better describe non
-
frontal faces,
a
symmetric
r
ectangle
f
eatures (ARFs) [12] are used to train the cascade of classifiers for each pose. The rectangle feature
set includes 3 types of
rectangle feat
ure
s (See Figure1).



Fig.1. Asymmetric features in profile faces and the
rectangle feature

s
et
. The left figure illustrates the asymmetric features
and the right figure illustrates the three types of asymmetric rectangle features we adopted.

Different
structures for multi
-
view face detection are discussed in [3]. These methods detected and classified a face
sample to a definite pose, where only five discrete poses are defined. However, our point is that in a video stream, face
poses may change continuou
sly. For a face between two neighboring discrete poses, it is very hard to tell which pose
class this face belongs to, since this face is similar to both poses. In this light, it is reasonable to set the pose of this

face
as a pose between the two neighbor
ing discrete poses. Thus, we adopt multiple pose
-
specific face detectors as our
appearance models. Figure 2 illustrates the structure of our face detectors.


Fig.2. Structure of multiple pose
-
specific face detectors
. Each face detector will detect faces o
f poses within a certain range.

For each pose, a cascade of classifiers is trained individually using Adaboost algorithm. So we obtain M pose
-
specific
face detectors (M is the number of poses, say M=9 in this study). Wu et al [13] also trained different ca
scades for each
pose. During detection, an input window will pass the first three layers of all the detectors, and then the pose will be
estimated by selecting the one with the highest confidence. Afterwards the input window will go through all the
remaini
ng layers. Our detection scheme is different from [13] in that the input window may go through all the layers of
the possible pose
-
specific face models before it is rejected as non
-
face. Since pose change is continuous and smooth, and
the border between tw
o discrete neighboring poses is not very clear, it is quite possible and reasonable that an input
window will pass all the layers of more than one detector, thus being detected as faces of different poses. And it is also




very common that for a face whose t
rue pose is between two discrete neighboring poses, neighboring pose
-
specific face
detectors may both have responses around this face area.

3.

A MIXED
-
STATE PARTICLE FILTE
R

A mixed
-
state particle filter is adopted where the object state contains continuous sp
atial parameters and discrete pose
parameter. Subsequently this particle filter is boosted via using the pose
-
specific face detectors mentioned above to
construct the proposal distribution (importance sampling function). In the following subsections, relat
ed elements of this
boosted mixed
-
state particle filter will be discussed.

3.1

Mixed
s
tate
s
pace and
o
bservation
m
odel

The state of p
article
f
ilter

is a mixed
variable
(,)
X x k

. The continuous
variable

x

is d
e
fined a
s
(,,)
x y
x t t s

, where
(,)
x y
t t

specifies the center of the face square and
s

specifies the side
length

of the face square. The discrete
variable

( 1,...,)
k k M

specifies

the pose index
to which the current observation b
e
longs, where
M
is the pose number.

Suppose input image
is
I
, and
( )
I x

is the image patch extracted from
I

accor
d
ing to the
spatial parameters
x
,
then

the
likelihood

( | )
p I X

is mo
d
eled as


( | ) ( |,) ( ( ) | )
p I X p I x k p I x k
 

(1)

T
o
calculate
( ( ) | )
p I x k
,
the
probability

of
( )
I x

belonging to the
k
th
pose
-
specific

face model
, we adopted the
likelihood
function

proposed in [6].

In this study, each pose
-
specific face model is a cascade of classifiers trained in
section 2. Suppose the total number of layers in
k
th

model

s cascade
is
k
N
,

and
k
n

is max
i
mum layer that an input
window has passed. For simplicity, we assume that the likelihood of the input window belonging to the model is related
to
/
k k
n N
.
Specifically
, the likel
i
hood is defined as


2
1/
1
( | ) ( ( ) | ) exp( )
2
k k
k
n N
p I X p I x k


  


(
2
)

where
k

is the standard
deviatio
n of model
k

and

the

normalization term
.

3.2

Dynamical
m
odel

The dynamic of the state is modeled as a first
-
order Ma
rkov process
1
( | )
t t
p X X

. For the mixed state, we assume that
the two components of the state are independent and at time t face pose
t
k

is only dependent on the pose
1
t
k

at the
previous time.
Then the transition density is


1 -1 1 1 1
( | ) (,|,) ( | ) ( | )
t t t t t t t t t t
p X X p x k x k p k k p x x
   
 

(
3
)

The dynamic of the
continuous

variable

x

is modeled as the zero
-
order
Gaussian

diffusion. Considering the
continuous
pose

transition, the dynamic of the discrete var
iable
k

is described as


1
1 1
1 2
( | ) 0,| - | 1
( | ),| - | 1
( | ),
t t
t t
t t
p k i k j if i j
p k i k j c if i j
p k i k j c if i j



   
   
   

(
4
)

and
1
( | ) 1
t t
i
p k i k j

  

.
Usually
c2

is bigger than
c1

since one often keep
s

at one fixed pose and change to other
poses
occ
a
sionally (e.g. i
magine the situation
where

one is talking with

different people
)
.





3.3

Objective
f
unction of
f
ace
p
ose
t
racking


For face pose tracking in a video sequence

{
It
,

t
=1,2,

}, at time
t
,

the

goal of trac
k
ing is to solve the MAP problem:


1:
argmax ( | )
t
t t t
X
X p X I


(
5
)

Following

Bayesian rule
,
the posterior probabilit
y

of
Xt

is inferred over time as


1
1
1:1 1 1 1 1:-1
1
( | ) ( ( ) | ) ( | ) ( | ) (,| )
t
t
M
t t t t t t t t t t t t
k
x
p X I p I x k p k k p x x p x k I


   





(
6
)

S
uppose each pose
model
is ass
o
ciated with a pose angle
{:}
k
k

, the estimated pose at time t is


1:
ˆ
( | )
k
t t t
k
p k I
 



(
7
)

3.4

Importance
s
amp
ling
f
unction

We integrate the pose
-
specific face
detection

results to
construct

the proposal distrib
u
tion. For each image at time
t
, after
p
erform
ing face
detection

with our
pose
-
specific
face detector
s, there may be more than one face responses with
diff
erent pose labels. Suppose
(,)
l l l
t t t
X x k

is the
l
th

detection
response at time t, where
l
t
x

denotes the sp
a
tial
parameter of a face response and
l
t
k

denotes the corresponding pose model. The impor
tance function
1
( |,)
t t t
I X X I


is
thus approximated as a mixture of Dirac
fun
c
tions
( )
l
t
l
X X



. Our proposal distribution is the mixture of the
importance function from face detection results and prior transition density, as fol
lows



t 0:1 1:1 1
( |,) ( |,) (1 ) ( | )
t t t t t t t
q X X I I X X I p X X
 
  
  

(
8
)

w
here


is the parameter balancing the two components.

The
parameter



can be adapted dynamically according to the detection results. If there is no detection respon
se

due to
noises
, we can set
0


, and the proposal distribution is prior transition distribution. If there are a lot of responses, we
increase

so that more importance will be placed on the dete
c
tion results
.

D
ue to
large
occlusions
or abrupt location
and pose changes, lost track will be declared if the accumulated face likelihoods are lower than a threshold, then the
track will be reinitialized using face detectors to search in larger areas and poses in the
following frames.

3.5

Outline of our algorithm

Generate


,,
i i i
t t t
x k


from


1 1 1
,,
i i i
t t t
x k

  
.

1,

Compute the importance function
1
( |,)
t t t
I X X I

.

2,
Resampling
. Resample


1 1
,
i i
t t
x k
 
to get


1 1
,
i i
t t
x k
 

3,
Prediction
.
F
or each


1 1
,
i i
t t
x k
 
:

G
enerate a uniformly distributed
number
[0,1]


.

(a)

if
 

,sample


,
i i
t t
x k
fro
m
1
( |,)
t t t
I X X I


,
set
i
t

as

1 1 1 1
1
1 t
( (,) | (,))
( |,)
N
j i i j j
t t t t t t t
j
i
t
i i
t t
p X x k X x k
I X X I


   


 







(b)

if
 

,sample


,
i i
t t
x k
from
1
( | )
t t
p X X

,

set
1
i
t



4,
Measurement
.

Compute

the
likelihood
t
( |,)
i i
t t
p I x k
. Update the weight of
each particle as
(I |,)
i i i i
t t t t t
p x k
 

.
T
hen normalize the weights so that
1
i
t



.

4.

EXPERIMENTS

4.1

Data
c
ollection

In our experiment, we use 9 pose models. Each pose model represents a rotation angle which ranges from
-
90 to +90
degree
s with a step of 22.5 degrees. Figure 3 shows an example of the 9 poses in our experiment.


Fi
g.3. Our face models of 9 poses which ranges from
-
90 to
+
90

degrees with a step of 22.5 degrees.

The data is collected as follows. We put markers on the wall wh
ich indicate the direction of the 9 poses. The person sits
in the center of the room and is asked to look at each marker successively by moving his head. The whole process is
recorded into a video. The person stops at each marker for a little while so that

we know the pose in the video. Training
data was collected by manually selecting images around the defined poses, so that our multiple pose
-
specific face models
can represent the faces with continuous poses. For example, the image set for pose 0° may cont
ain images with poses
from
-
10° to 10°. The negative sample set contains the background image, the images with only half face and the
positive patterns of non
-
neighboring poses. For example, the positive samples of pose 0° are negative samples of 6 non
-
nei
ghboring poses [
-
90°,
-
67.5°,
-
45°, 45°, 67.5°, 90°]. Figure 4 shows the training samples of pose 0°. We collected
training data of 10 individuals.


Fig.4. Training data of pose 0°.
The l
eft are positive samples;
The r
ight
are
the negative samples. The sa
mple size is 24*24.

4.2

Face
p
ose
t
racking

We applied our method to videos of people in the training set. Figure 5(a) shows some tracking results of a sequence
with intensive pose and location changes. Figure 5(b) is the estimated weighted poses. Track is lost

in some frames with
full occlusion but will be reinitialized when the face appears.

We also tested the method on people not in the training set. Figure 6 shows the results of a real meeting discussion video,
where the person is talking with 3 other peop
le. The person shakes his head showing his disagreement with others which
cause the vibration of pose numbers in figure 6(b). Experiments show that our method is robust for expression changes,
partial occlusions and small head tilt and rotation in plane.






Fig.5. Tracking results
of faces
with location
change
and
large
pose
variations.


Fig.6. Tracking results of
faces
in
a
group conversation

with head shakes and head nods.





5.

CONCLUSION AND FUTUR
E WORK

We proposed to address the problem of face pose trac
king within a mixed
-
state particle filter. We combined the face
detection results and the prior transition model to construct the proposal distribution, which helps generate good samples.
Our method can obtain continuous face pose tracking result and is ro
bust to large pose changes, partial occlusions and
expressions.

In future work, the f
ace poses will be used for recognizing visual focus of attention
and head gestures
of participants
in
meeting video analysis.

ACKNOWLEDGEMENT
S

This work is supported by N
ational Natural Science Foundation of China under grants No. 60673189.

REFERENCES

1.

S.O. Ba and J.M Odobez, Head pose tracking and focus of attention recognition algorithms in meeting rooms.
CLEAR 2006

2.

Doucet, A., de Freitas, J. F. G., N. J. Gordon, editors:

Sequential Monte Carlo Methods in Practice. Springer
-
Verlag,
New York (2001)

3.

Huang, C., Ai, H., Li, Y., Lao, S.: Vector boosting for rotation invariant multi
-
view face detection. In: ICCV. (2005)

4.

Isard, M., Blake, A.: Condensation


conditional density pr
opagation for visual tracking. International Journal on
Computer Vision, 28(1):5
-
28 (1998)

5.

Lee, K.C., Ho, J., Yang, M.H., Kriegman, D.: Video
-
Based Face Recognition Using Probabilistic Appearance
Manifolds. In: CVPR. (2003)

6.

Li, P., Wang,H.:Probabilistic Fa
ce Tracking Using Boosted Multi
-
view Detector. PCM(2004)

7.

Li, Y., Ai, H., Huang, C., Lao, S.: Robust Head Tracking Based on a Multi
-
State Particle Filter. In: FG2006, pp.335
-
340, Southampton, UK, April 10
-
12. (2006)

8.

Okuma, K.,et al.: A boosted particle filt
er: Multitarget detection and tracking. In: ECCV. (2004)

9.

Rui, Y., Chen, Y.: Better Proposal Distributions: Object Tracking Using Unscented Particle Filter. IEEE Conference
on Computer Vision and Pattern Recognition, pp. 786
-
793 (2001)

10.

Voit, M., Nickel, K.,

Stiefelhagen, R.:Multi
-
view Head Pose Estimation using Neural Networks. In: Computer and
Robot Vision. (2005)

11.

Wang, P., Ji, Q.:Multi
-
View Face Tracking with Factorial and Switching HMM. In Proc 7th IEEE Workshops on
Application of Computer Vision. (2005)

12.

Wang, Y., Liu, Y., Tao, L., Xu, G.: Real
-
Time Multi
-
View Face Detection and Pose Estimation in Video Stream. In
ICPR 2006. vol.4, no.pp. 354
-

357, 20
-
24 Aug. (2006)

13.

Wu, B., Ai, H., Huang, C., Lao, S.: Fast rotation invariant multi
-
view face detection based

on real adaboost. In: Intl.
Conf. on Automatic Face and Gesture Recognition. (2004)