Audio-Visual Multimodal Fusion for Biometric Person Authentication and Liveness Verification

licoricebedsSecurity

Feb 22, 2014 (3 years and 7 months ago)

79 views

Audio-V
i
sual M
u
ltimodal Fusi
on fo
r
Biometric Person
Authentica
tion
and Liven
e
ss
V
erification

Giri
ja
Chetty and Michael Wagner
H
u
m
a
n Com
p
u
t
e
r
Comm
un
ic
ati
o
n La
b
o
rat
o
ry
S
c
hoo
l o
f
Infor
m
a
t
io
n
S
c
ienc
e
s
a
nd En
gi
nee
r
ing
U
n
i
v
e
r
sity
of Ca
nber
r
a,
A
u
stra
l
i
a
g
iri
ja.c
het
ty@
can
ber
ra.
edu
.au


A
b
stra
ct

In
t
h
i
s
paper
w
e
prop
ose

a
mult
imo
d
a
l

f
u
si
on fram
e
w
o
r
k

base
d o
n

n
o
v
e
l

face
-v
oice
fus
i
o
n

t
e
c
h
n
i
que
s for
biom
etric

perso
n
a
u
the
n
t
i
cat
i
o
n
a
n
d
li
v
e
ne
ss
ve
ri
ficat
i
on.
Che
c
k
i
n
g
l
i
v
e
n
e
ss
gu
ards
t
h
e
sy
st
em ag
ai
n
s
t
s
poo
f/
re
p
l
ay
a
t
t
a
ck
s by
e
n
su
ri
ng

t
h
at

t
h
e bio
m
et
ri
c d
a
t
a

i
s

cap
tu
re
d

fro
m an
aut
h
orise
d
li
ve
per
s
on.
The pr
op
ose
d
fram
ew
ork b
a
se
d
on


bi-mo
d
a
l
fea
t
u
r
e

fus
i
o
n
, cross
-
m
odal
f
u
s
i
on
as
w
e
ll
as 3D

sha
p
e a
n
d te
x
t
ure fus
i
on
tec
h
n
i
que
s,
all
o
w
a
si
g
n
i
fica
n
t

i
m
p
r
ov
emen
t

in
sy
st
e
m
p
e
rfo
rman
c
e
ag
a
i
nst
i
m
po
st
o
r

attac
k
s,

ty
pe
-1
repla
y

at
tac
k
s (sti
ll

ph
ot
o
an
d p
r
e-rec
o
rd
ed

a
u
dio
)
, a
n
d
ch
al
l
e
ngi
ng
ty
pe-2
rep
l
a
y
at
ta
ck
s
(
C
G

anim
at
ed
v
i
de
o
fr
om a

st
i
ll
p
h
o
t
o a
nd pre
-re
corde
d

a
u
d
i
o)
and
ro
bust
n
ess

to pose
an
d il
l
u
mina
t
i
o
n
var
i
a
tio
ns.
K
eyw
or
d
s
: m
u
lt
imoda
l

fusi
on, b
i
om
etric
a
u
the
n
t
i
ca
tio
n,

live
n
ess ver
i
f
i
c
a
ti
o
n
.

1

Introdu
c
tion

D
u
e
to
inc
r
ea
s
e
d sec
u
r
i
t
y

thr
eats,

b
i
om
etric
tec
hno
l
o
g
y

is
evo
l
vin
g
a
t
a
n

e
n
ormous
pac
e
,
and ma
n
y

c
o
u
n
t
ries
ha
ve
st
a
r
te
d us
i
ng
biom
etric
s

for
borde
r co
n
t
ro
l a
nd
na
t
i
o
n
a
l

ID c
a
r
d
s
. Of
l
a
t
e
, b
i
o
m
et
ri
c t
e
c
h
nol
ogy

i
s
not
ju
st
l
i
m
i
t
e
d
to
nat
i
ona
l

sec
u
ri
ty sce
n
ari
o
s
,
but a
l
s
o
be
in
g
used for
a

w
i
de ran
g
e
of a
ppl
ica
t
i
o
n d
o
m
a
i
ns s
u
ch a
s


for
e
n
s
i
c
s, for
crim
ina
l
ide
n
ti
fica
t
i
o
n
a
nd
pris
on sec
u
r
ity, a
nd, a n
u
m
b
er
of ot
her

ci
v
ilia
n ap
p
lic
at
i
o
ns
s
u
c
h
a
s

pre
v
e
n
t
i
ng

una
u
t
h
o
rize
d
ac
ce
ss t
o
AT
Ms,
cel
lu
lar
ph
o
n
es,

sm
art

ca
rds, desk
to
p P
C
s,

w
o
r
k
st
a
t
i
ons, a
nd c
o
mp
uter ne
t
w
or
ks
(Ross,

P
r
abha
kar an
d Ja
in,
20
0
3
). In add
i
tio
n,

t
h
er
e is a

rec
e
nt s
u
rge
i
n
use
of
b
i
om
etr
i
c
t
e
c
h
n
o
lo
g
y

fo
r
c
o
ndu
c
tin
g t
r
an
s
act
ion
s

via
t
e
l
e
p
hon
e

a
n
d Int
e
rn
e
t

(ele
ctro
n
i
c co
mm
erce

a
n
d
ele
c
tr
on
ic ba
n
k
i
ng)
, and in

aut
o
mo
bi
les f
o
r ke
y-less e
n
t
r
y

and
ke
y-les
s
ig
n
iti
on.

Biom
etri
cs au
t
h
en
t
i
ca
ti
o
n
(
Am I w
hom I cl
aim I am?
)
inv
o
l
v
es c
o
n
f
irm
i
n
g

or de
ny
i
n
g
a pers
on
's
c
l
aim
e
d
ide
n
tity
base
d

o
n
h
i
s
/
he
r p
hys
i
o
log
i
ca
l
o
r

beha
vi
o
u
ra
l
cha
r
ac
t
e
rist
ics (K
ittler
,
M
a
tas, and
S
a
nc
h
e
z 1
997)
.
Th
is me
t
hod
of
ide
n
ti
ty ve
rifica
t
i
on is p
r
e
f
er
red over

t
r
a
d
i
t
i
o
n
a
l
met
hod
s in
vol
vi
ng
p
a
ssw
o
r
ds
a
n
d
PIN
numbe
rs for
var
i
ous
rea
s
o
n
s
:

(i)
t
h
e
pe
rson t
o
be


Cop
y
rig
h
t
©
200
6,
Au
st
rali
an Co
m
p
uter S
o
ciety
,

Inc.
This
pap
e
r app
e
ared
a
t

t
h
e
NICT
A-HCS
N
e
t
Mul
ti
mo
da
l
U
s
e
r

Int
e
ract
io
n
W
o
rks
h
o
p
(
M
MUI2
00
5)
, Syd
n
ey
,
Au
stral
i
a.

Con
f
eren
ces i
n
Res
earch an
d
P
r
acti
ce in Inf
o
rm
a
tio
n

Tech
nol
og
y,
Vol
.

5
7
.
Fan
g

Ch
e
n
an
d J
u
lien Ep
ps,

Ed
s.

Reprod
uct
i
o
n

f
o
r acad
em
ic,
not-f
or
pro
f
i
t
p
u
rpos
es p
e
rm
itt
ed
pro
v
id
ed
th
i
s
text

i
s
i
n
c
l
uded
.


aut
h
e
n
t
i
c
a
t
e
d
i
s
require
d
t
o

be
p
hysic
a
l
l
y

pr
esen
t at t
h
e
poi
nt-
o
f-
veri
fi
ca
t
i
o
n
;
(ii) V
e
rifica
t
i
o
n
ba
sed
on
b
i
om
etric

t
e
ch
n
i
q
u
es
re
move
s
the

nee
d

t
o

rem
e
m
b
er a p
a
ssw
o
rd
or

car
ry
a
t
o
ke
n. Due t
o
incr
e
a
se
d use
of c
o
m
p
u
t
ers a
n
d

i
n
ter
n
e
t

for i
n
form
ati
on ac
ce
s
s
,
se
nsit
ive
/
person
a
l

da
t
a
o
f

an in
d
i
v
i
dua
l

is ea
s
ily

a
v
a
i
l
a
ble, a
n
d
it
i
s
nece
ssa
r
y

t
o

restrict u
n
au
t
h
orize
d

a
cce
ss t
o

s
u
ch pri
v
a
t
e

in
form
ati
on.
M
o
r
e
over,

to
re
membe
r
seve
ral
P
I
N
s
a
nd passw
ords
i
s

diffic
u
l
t
,
an
d toke
n ba
se
d me
t
h
o
d
s o
f
ide
n
ti
t
y
veri
fica
t
i
on

l
i
ke pas
s
p
o
r
t
s
and driver
'
s
lic
e
n
ses
ca
n be forge
d
,

sto
l
e
n
,
or los
t
. Rep
l
ac
em
ent o
f
P
I
N
s
a
nd pa
ssw
ords
w
ith

biom
etric
tec
hni
q
u
es i
s

he
nc
e a m
o
re effi
cien
t way
of
preve
n
t
i
ng u
n
a
ut
h
o
rize
d or

fr
aud
u
l
e
n
t u
s
e
of
A
T
Ms,
cell
u
l
a
r ph
o
n
e
s
, smart car
ds, desk
top P
C
s, w
o
r
k
sta
t
io
ns,
and c
o
mp
u
t
er
netw
orks.

V
a
rious

t
ype
s of
b
i
om
etric
t
r
ait
s

c
a
n
be

u
s
ed for perso
n

a
u
th
en
ti
cati
o
n
,

su
ch
as f
a
c
e
, i
r
i
s
,
fi
n
g
e
rp
rin
t
s
,

DNA,



reti
nal sca
n
,
speec
h, sig
n
a
t
ures and
ha
n
d
ge
ome
t
ry.

H
o
w
e
ver
,
seve
ral hum
an
fact
ors ne
ed
s t
o
b
e
take
n i
n
t
o

con
s
ide
r
at
io
n for

de
plo
y
m
e
n
t
of
c
i
v
i
l
i
a
n
, e-c
o
mm
erc
e
and
t
r
a
n
sa
ct
ion
c
o
nt
ro
l

a
ppl
i
catio
ns

un
l
i
k
e
bo
rd
er-co
n
t
rol

and
se
curi
ty a
p
p
lic
atio
ns,
a
nd i
t
i
s

nece
ssa
r
y
to

m
a
ke
use
of
l
e
ss i
n
trus
ive
bi
ome
t
r
i
c

trai
t
s
.

F
a
ce
and v
o
ice
bi
om
etric

sys
t
em
s rate
hi
g
h
in
term
s o
f
u
s
e
r
ac
cepta
nc
e a
n
d

dep
l
o
y
m
e
n
t
co
sts d
u
e
t
o
l
e
s
s
intr
u
s
ive
n
e
s
s
and ease
of
availability of
low-
cos
t
off-t
he
s
h
e
l
f s
y
s
t
em
compone
n
ts
(P
oh and
K
o
r
cza
k, 2001)
.
A
biom
e
t
ric
s
y
s
t
em

i
s

esse
nt
ia
ll
y a
pa
t
t
e
r
n re
cog
n
i
tio
n
sys
t
em

w
h
ic
h
verifie
s

the

ide
n
ti
ty o
f
a

per
s
on
b
y

determ
i
n
in
g t
h
e aut
h
e
n
t
i
c
i
t
y

of a s
p
eci
fic p
hys
io
lo
g
i
cal
or
beha
v
i
o
u
ral c
h
ar
acter
i
s
t
i
c p
o
sse
sse
d
by t
h
e user.

An

i
m
por
tan
t
issu
e

in des
i
gn
in
g a

prac
tic
al b
i
o
m
etric

system

i
s
t
o
de
ter
m
ine h
o
w
a
n
i
ndi
vi
d
u
al ca
n
be re
l
i
a
b
l
y

discr
i
m
i
na
te
d fr
om ano
t
her

in
d
i
v
i
dua
l
ba
s
e
d
o
n
t
h
ese

char
acter
i
s
t
i
cs
in pr
ese
n
ce
of
vari
o
u
s env
i
ro
nm
en
tal

degra
d
a
t
i
ons,
and
w
h
e
t
her
t
h
ese c
h
ara
c
ter
i
s
tics
ca
n
be

easi
l
y fa
ked or
spo
o
f
ed.

V
a
rious
st
u
d
ie
s
(Ro
s
s,
P
r
abh
a
ka
r a
nd
Ja
i
n
, 20
0
3
,
K
i
t
tler
,

M
a
t
a
s, a
nd S
a
nche
z
19
9
7
,
Poh a
nd K
o
rc
za
k,
20
01)
ha
ve

i
n
d
i
ca
te
d

t
h
a
t
no

sin
g
l
e
m
o
da
lit
y
ca
n pro
v
i
de an ade
qua
te

so
lu
t
i
on a
g
ai
n
s
t
im
post
o
r or
sp
oo
f a
t
ta
c
k
s. S
i
ng
le m
o
de

sys
t
em
s
in ge
ne
ral
a
r
e l
i
m
i
t
e
d in perf
orm
a
nce

due to

un
a
c
c
e
p
ta
ble
er
ror r
a
tes, sensit
i
v
it
y t
o

n
o
i
sy b
i
om
etric

data, fa
i
l
ur
e t
o
enr
o
l

ra
tes,
a
n
d
re
d
u
c
e
d
fle
x
i
b
i
lit
y

to o
ffe
r
alt
e
r
n
ate

bi
o
m
etric
tra
its. I
n
orde
r t
o
c
o
p
e
wi
t
h
t
h
e
l
i
m
ita
tio
ns of si
ng
le-m
ode
b
i
om
etr
i
c
s
,

r
e
sea
r
che
r
s
ha
ve

pro
pose
d
u
s
i
n
g

mul
t
i
p
l
e
b
i
o
m
e
t
r
i
c
tra
its c
onc
urre
nt
ly
for
verifica
tio
n.
Suc
h
sys
t
em
s
are
c
o
mm
on
ly
k
now
n a
s
m
u
lti-
moda
l pe
rso
n
a
u
the
n
tica
t
i
o
n
sys
t
e
m
s (P
oh
a
nd K
o
rcz
a
k
,
20
0
1
). By fus
i
ng m
u
lt
ip
l
e

biom
etric
tra
its, sys
t
em
s
ga
in

mor
e
im
muni
t
y

to

in
tru
d
er
attac
k
s.
For
a
n

a
u
d
i
o-v
i
s
u
a
l

perso
n
au
t
h
en
t
i
cat
i
on
sys
t
em
for exa
m
p
l
e,
it
w
ill be m
o
re

diffic
u
l
t
for a
n

im
p
o
st
or to
i
m
per
s
o
n
a
t
e
ano
t
her

per
s
o
n

us
i
n
g
b
o
t
h
a
u
di
o a
n
d
v
i
sua
l
i
n
form
ati
o
n sim
u
l
t
a
n
e
ous
ly

(Che
un
g, M
a
k,
K
ung,
20
04)
.
In ad
di
t
i
o
n
,

f
u
si
on
of m
u
lt
ip
le cu
es,
suc
h
a
s

t
hose
fr
om
fac
e

an
d v
o
i
ce
can
i
m
pr
ove
s
y
ste
m
relia
b
i
l
i
t
y
a
n
d

ro
bu
st
n
e
ss
. F
o
r i
n
st
anc
e
, wh
i
l
e

b
a
ckg
r
o
und
noi
se

h
a
s a
detr
im
enta
l e
f
fec
t

on

the
perf
orma
nce
o
f

vo
ic
e

biom
etr
i
c
s
, it
does
not
ha
ve a
ny
in
flue
nce

o
n
fa
ce

biom
etric
s
.
On

the
o
t
her
ha
n
d
,
whi
l
e t
h
e

p
e
rform
ance
of
fac
e
re
co
gn
i
t
i
on
syste
m
s d
e
pe
nds
hea
v
il
y o
n

l
i
g
h
t
i
n
g

con
d
iti
on
s,
li
g
h
ti
ng d
o
es no
t
ha
ve
a
n
y
ef
fec
t

o
n
the v
o
ice

qua
l
ity.

H
o
w
e
ve
r, cur
r
ent a
u
d
i
ov
is
ual mu
l
t
i
m
o
d
a
l b
i
om
etric

systems m
o
s
tly ver
i
fy a
pe
rson’s fa
ce s
t
atica
l
l
y
, and

hence

t
h
e
s
e syste
m
s th
oug
h m
a
y have

an acc
ep
tab
l
e

p
e
rfo
rma
n
c
e
a
g
ain
s
t

i
m
po
st
o
r

at
t
ack
s, th
ey
remai
n

vu
lne
r
ab
le t
o

spo
o
f
an
d
r
e
play a
t
t
a
c
k
s, wher
e
a

fake

biom
etric
is p
r
esen
ted by
t
h
e
intru
d
e
r

t
o

ac
ce
ss
t
h
e

f
a
cil
i
t
y
.
To

re
si
st

su
c
h
at
ta
ck
s
,
p
e
rs
on

au
th
en
ti
cati
o
n
sho
u
l
d
i
n
cl
u
d
e
ve
rifica
ti
o
n
o
f
the

l
i
v
e
n
ess”
of
biom
etric

data
prese
n
t
e
d to
the
sys
t
em
[C
het
t
y a
n
d Wa
gner,

2
0
0
4
a
]
.
Li
venes
s

verific
a
t
i
on

in
a
b
i
ome
t
r
i
c s
y
ste
m
m
e
a
n
s t
h
e

ca
pab
i
lit
y to
detec
t

a
n
d
v
e
rify, w
h
e
t
he
r
or not t
h
e
biom
etric
sa
m
p
l
e

pre
s
en
te
d
is a
l
i
v
e
o
r
no
t,
dur
i
n
g
trai
ni
n
g
/e
nro
l
m
e
nt a
n
d tes
t
i
ng
phase
s.
The
system
m
u
st
be
desi
g
n
ed
t
o
pr
ot
e
c
t
a
g
a
i
ns
t
attac
k
s
w
i
th

artific
ia
l/s
y
n
th
e
s
i
z
e
d
a
u
d
i
o
a
n
d
/
o
r
v
i
de
o r
e
cor
d
i
n
gs,
w
i
t
h

chec
k
s

t
h
a
t
en
sure t
h
at t
h
e
prese
n
te
d b
i
o
m
e
t
r
i
c

sa
mple

bel
o
ng
s t
o
th
e li
ve hum
an
bei
n
g w
h
o w
a
s orig
i
n
al
ly

enro
lle
d i
n
t
h
e

sys
t
em

an
d n
o
t
jus
t

a
n
y live hum
an

be
i
ng.

Un
ti
l

n
o
w
, a
l
t
hou
gh
t
h
ere

h
a
s b
e
e
n
mu
ch
pu
bli
s
h
e
d
rese
arc
h
on
t
h
e l
i
v
enes
s,
for e
x
am
pl
e
,
o
f
fin
g
erpr
in
ts,
rese
arc
h
on
l
i
v
ene
s
s
v
e
rific
a
tio
n
i
n

a
u
di
o
-
vis
u
a
l
per
s
on

aut
h
e
n
t
i
ca
t
i
o
n
sys
t
em
s has b
e
e
n

ve
ry l
i
mit
e
d.

Live
ne
s
s

verificati
o
n for f
a
ce-
voi
ce pe
rson au
t
h
en
t
i
c
a
ti
o
n

sys
t
em
s
sho
u
l
d
be

p
o
ss
ib
l
e
d
u
e
to
re
a
d
y

a
v
a
i
lab
i
l
i
t
y
of m
u
ltim
oda
l
syn
c
hro
n
o
u
s
fac
e
vo
ice da
ta
from
speak
i
ng
fa
ces.
We
pro
pose
t
h
a
t
n
ove
l

fea
t
ur
e e
x
trac
ti
o
n
a
nd
fus
i
on

tech
n
i
q
u
es
th
at u
n
co
ver the

s
t
a
tic a
nd dy
na
mic

rela
tio
ns
hip

be
tw
een
fac
e
-v
oice
b
i
om
etr
i
c
i
n
form
atio
n

w
ill al
low

l
i
v
e
n
e
s
s veri
fic
a
t
i
o
n
t
o

be

c
a
rr
i
e
d
ou
t in per
s
on

a
u
th
en
ti
c
a
ti
on
sy
st
ems.
In t
h
i
s

pa
pe
r, som
e
of
t
h
e de
t
a
il
s o
f
pr
op
ose
d
m
u
lt
imo
d
a
l

fus
i
on fr
am
ework for

pe
rs
on

a
u
the
n
tica
tio
n
and
liv
ene
s
s
verific
a
t
i
o
n
ba
sed on
n
o
v
e
l t
echn
i
que
s
f
o
r
fusi
on o
f

fa
ce

and
v
o
ice

fe
at
ures is de
sc
ri
b
e
d.
The te
c
h
nique
s base
d
on

st
a
t
i
c

and

dyna
mi
c
b
i
m
o
d
a
l
f
e
a
t
u
r
e
fu
sio
n
,

c
r
o
s
s
-
mo
d
a
l

f
a
ce
-v
oi
ce
fu
si
on

a
n
d

i
n
t
r
a-mo
d
a
l
sh
ap
e
an
d

t
e
xt
u
r
e
fus
i
on usi
n
g
3
D
face

m
odels
, form the cor
e
mult
imod
a
l

fus
i
on a
p
p
r
oa
che
s
o
f
the
fr
am
ew
ork,

a
n
d
a
llow

a
si
g
n
i
f
i
c
a
n
t

enh
a
n
c
ement

in th
e sy
st
e
m
p
e
rfo
rman
c
e

aga
i
ns
t
imp
o
st
or an
d r
e
pla
y

attac
k
s.
T
h
e
p
e
rform
a
nc
e o
f

the
pro
p
o
s
ed
fe
at
ure e
x
trac
tio
n a
nd m
u
l
t
i
m
odal
fu
si
on

t
e
c
hni
qu
e
s
in
t
e
rms o
f
equ
a
l e
r
ro
r ra
t
e
s (EER
s) a
n
d
detec
t
or
error trade-of
f (DET
)
c
u
rves wer
e
exam
ine
d
by

con
d
u
ct
i
ng ex
p
e
rime
nt
s wi
th
thr
ee d
i
ffe
r
en
t
s
p
e
a
k
i
ng

fa
ce
data
c
o
rp
us,
descri
bed
i
n

nex
t
sec
t
i
o
n. T
h
e
de
ta
il
s of
som
e
of
the
fe
ature

e
x
trac
ti
on a
nd m
u
l
ti-
m
o
d
a
l fu
sio
n

t
e
ch
n
i
q
u
es de
v
e
lo
pe
d
a
r
e g
i
v
e
n in sec
t
io
n 3.

The
de
t
a
i
l
s

of im
p
o
st
or an
d rep
l
ay a
t
tac
k
expe
r
i
me
nt
s w
ith resu
l
t
s
a
r
e

give
n

in sec
t
i
o
n 4,
foll
ow
ed
b
y
c
o
n
c
l
u
s
i
o
n
s i
n
se
ct
i
o
n
5.
2

Speak
ing
face
data
corpu
s

The
s
p
e
a
k
i
n
g
fa
ce
da
ta fr
om
three differen
t

data
co
rpu
s
,
Vi
d
T
I
M
IT,
UC
BN an
d
AVOZES was u
s
ed

fo
r
c
o
ndu
c
t
i
n
g i
m
p
o
s
to
r a
n
d
re
pl
ay
at
t
a
c
k

e
x
pe
ri
me
n
t
s. Th
e
V
i
dTI
M
I
T
mul
tim
oda
l
per
s
on a
u
the
n
t
i
c
a
ti
o
n
da
ta
base

[S
a
nders
on a
n
d P
a
li
w
a
l, 2
0
03],
c
onsis
ts
of
vi
de
o
and

corre
spon
di
n
g

aud
i
o
re
cord
in
gs
o
f
43
peo
p
l
e
(
19
fem
a
le

a
n
d

24

mal
e
). Th
e mea
n

du
ra
ti
on

o
f
e
a
c
h
se
nt
en
c
e
i
s

arou
nd 4 sec
o
nds,
or a
p
prox
im
a
t
e
l
y 10
0
v
i
deo fra
m
e
s
.

A
broa
dca
s
t

qua
lit
y d
i
gi
ta
l v
i
d
e
o ca
me
ra in a
no
i
s
y
offi
c
e

env
i
ro
nme
n
t
w
a
s
use
d
to

re
cord t
h
e da
ta
.
The
vi
de
o of

eac
h pers
on
is
st
ore
d
a
s
a seq
u
enc
e

of JP
EG
i
m
age
s
w
ith

a re
solu
ti
o
n
o
f
51
2
!
384

pi
x
e
l
s
wit
h
co
rresp
o
ndi
ng

a
u
d
i
o
pro
v
ide
d

as a
16-b
i
t 3
2
-kH
z
mono P
C
M
fi
l
e
.
The sec
o
nd
t
y
pe
of
da
ta
used
is t
h
e U
C
BN
da
ta
base,
a

free

t
o

air bro
a
dca
s
t

new
s
d
a
taba
se
.
The br
oadca
s
t
new
s

i
s
a co
n
tin
uous

so
u
r
c
e
of
vide
o

s
e
qu
en
c
e
s, wh
i
c
h

c
a
n

b
e

easi
l
y
o
b
t
a
i
ne
d
or re
corded,
a
nd
has opt
im
al

il
lumi
na
ti
o
n
,
col
o
u
r
, and
so
un
d rec
o
r
d
i
n
g

c
o
n
d
i
t
i
o
n
s
.
H
o
w
e
ve
r, som
e

of the
a
t
t
r
ibu
t
e
s
o
f

broa
dcas
t
new
s

data
ba
se

suc
h
as
nea
r
-
fron
ta
l
im
ages,
sma
l
le
r fac
i
a
l
r
e
gi
on
s,
m
u
lti
ple
face
s
an
d

c
o
mp
l
e
x
b
ackg
r
ou
nd
s
req
u
i
r
e
an

e
f
fi
ci
en
t
f
a
ce

d
e
t
e
c
t
i
on
and
trac
k
i
ng s
c
he
me
to

be

us
ed. The
da
ta
ba
se
con
s
ists
of
20
-
40 se
c
o
nd
vi
de
o cl
i
p
s for anc
h
or perso
n
s an
d

new
s
rea
d
ers w
ith
fr
on
tal
/
n
ea
r-fr
o
n
t
a
l

s
h
o
t
s o
f

10
di
ffe
ren
t

fac
e
s (
5

fe
ma
le
an
d
5 m
a
le)
.
Ea
ch
v
i
de
o

sam
p
l
e

is
2
5

fram
es pe
r sec
o
n
d

M
P
EG
2 enc
ode
d s
t
r
eam

w
ith a

resol
u
ti
on of 720 ×
57
6
p
i
xe
ls,

w
ith
corre
sp
ond
in
g 16
bi
t,
48
kH
z P
C
M
a
u
d
i
o.




Fi
gure

1
:
F
a
c
e
s

f
r
om (a) V
i
dT
IM
IT
,
(b)

UCB
N
,
(c)
AV
OZES

The t
h
ird
da
ta
base
use
d

i
s

th
e
A
V
O
ZES
databa
se
, an

aud
i
o
v
is
ua
l cor
p
us de
vel
o
ped for a
u
to
m
a
tic spee
c
h

rec
ogni
t
i
o
n

resea
r
ch (
G
oeck
e
and
M
iller
,
20
0
4
).
The
corp
us c
o
n
s
i
s
t
s

o
f
20

nat
i
v
e spe
a
k
er
s of A
u
s
t
ral
i
an

Engl
is
h (10
f
e
m
a
le a
nd
1
0
m
a
le
spe
a
k
e
r
s
), a
nd t
h
e
aud
i
o
v
is
ua
l d
a
t
a
wa
s
rec
o
rded w
i
t
h
a
s
t
e
r
eo
cam
er
a
sys
t
em
t
o
ac
hie
v
e more

a
c
c
u
rate 3D me
asurem
ents o
n
th
e

fac
e
. The
rec
o
rdi
ngs w
e
re

ma
de at 3
0
H
z
vi
deo
fr
am
e
rate
and 1
6
b
i
t
4
8kH
z
mon
o
a
u
d
i
o
rate
i
n
a
c
o
ntr
o
l
l
e
d

acou
s
t
i
c en
v
i
ronme
n
t
w
i
th no e
x
t
e
r
n
a
l
n
o
i
se,

a
nd
s
o
m
e

bac
kgr
o
und c
o
m
pute
r

an
d a
i
r-con
di
t
i
on
i
n
g
n
o
i
se. F
o
r
ea
ch
s
p
eake
r
there

w
e
re

3 spo
k
e
n
ut
tera
n
ces,

10 di
g
it
se
qu
e
n
ce
s, 18 p
hon
e
m
e
se
q
u
e
n
c
e
s
(CVC
w
o
rd
s in
a
ca
rrier phr
ase),
and
2
2
VC
V
p
h
o
n
em
e

se
que
nce
s
(
V
C
V

words in
a c
a
rri
e
r
phr
ase.

F
i
gure
1
a
, 1b
a
nd 1c
, sh
ow
sam
p
le da
ta

fro
m
V
i
dTIMIT,
U
C
BN
a
n
d

A
V
O
ZES
c
o
rpus. T
h
e

thre
e t
y
pes
of
data
ba
se
s re
pr
esen
t ve
ry
di
ff
ere
n
t

ty
pe
s o
f

spe
a
k
i
n
g fa
ce

data,
V
i
d
T
I
M
I
T

w
i
th
ori
g
ina
l

a
u
dio re
cord
e
d

i
n
a no
isy

env
i
ro
nm
en
t and c
l
ea
n
v
i
sua
l
e
nvir
onm
en
t,

U
C
BN
w
i
t
h

clea
n
a
u
di
o
a
n
d vis
u
al
e
n
vi
r
onm
ent
s
, bu
t
co
m
p
le
x v
i
s
u
a
l

backgr
ounds,
and AVOZES

with

stereo face data for
bet
t
er
3D
fa
ce
m
odel
i
ng.

3

Multi
m
od
a
l
F
u
s
i
on
F
r
a
mework
The pro
p
o
se
d mult
imo
d
a
l
fu
si
on
fram
e
w
o
rk is ba
sed
on

of
t
h
ree
cor
e

fus
i
o
n
a
p
proa
c
h
es,
b
i
m
o
dal
fe
ature

fu
sio
n

(BF
F
)
, cross-m
odal
fusi
o
n

(CM
F
),
and 3
D

mult
i
-
moda
l

fus
i
on
(
3
M
F
). A
udi
o-v
i
sua
l
fus
i
o
n

in
t
h
e
s
e thr
e
e

appr
oa
ches is per
f
orm
e
d
a
t
d
i
ffere
n
t
l
e
ve
ls w
ith
d
i
ffere
n
t

f
e
atu
r
e
s
, fo
r u
n
c
o
v
e
ring
t
h
e f
a
ce
-v
oi
ce re
l
a
t
i
on
shi
p

n
e
ed
e
d

fo
r ch
ec
kin
g
t
h
e
l
i
v
e
n
ess and
est
a
b
l
i
s
h
e
s

th
e
ide
n
tit
y of t
h
e
perso
n
.

A
bri
e
f de
scr
i
p
t
i
o
n

of

t
h
e
thr
ee
fus
i
on ap
proa
c
h
es
use
d
is g
i
v
e
n
i
n

ne
x
t
sec
t
ion.

3.1

B
i
modal

Feature Fusion (BMF)
The c
l
a
s
s
i
ca
l a
ppr
oac
h
es

t
o
a
udi
o-v
i
s
u
al m
u
lt
imo
d
a
l

fus
i
on are
base
d o
n
la
t
e
fu
si
o
n

an
d i
t
s
v
a
ria
n
t
s
, a
nd ha
ve

bee
n
i
n
ves
t
i
g
a
t
ed
i
n

grea
t

de
p
t
h (K
itt
ler,
M
a
tas, a
n
d

S
a
nchez
19
97,

P
oh an
d K
o
r
c
z
a
k,

20
0
1
).
Late fus
i
o
n
, or
fus
i
on at
the

sc
or
e l
e
vel,
i
n
v
o
l
v
es c
o
m
b
in
in
g the
score
s
o
f

d
i
ff
e
r
en
t cl
ass
i
fi
ers,
e
a
c
h o
f

whi
c
h
h
a
s ma
d
e
an
inde
pen
d
e
n
t
de
cis
i
o
n
. Th
is m
e
a
n
s,
how
e
v
er
, tha
t

m
a
n
y
o
f

corr
elat
i
on pro
p
e
r
t
i
e
s
of
t
h
e
j
o
i
n
t

a
u
dio
v
i
de
o

da
t
a

a
r
e

lost.

F
u
si
on a
t
fea
t
ur
e-leve
l
(B
MF
) on t
h
e
o
t
her

ha
n
d
, can
sub
s
ta
n
tia
l
l
y
i
m
prove
t
h
e p
e
r
f
orm
a
nc
e of
the
mul
t
i
moda
l

sy
st
ems a
s

the
f
eat
u
r
e set
s
p
r
ovi
d
e
a ri
ch
e
r

sou
r
ce

o
f

inf
o
rm
ation
th
an t
h
e m
a
tch
i
n
g

sc
ores,

a
nd
be
ca
use
in
th
is
mo
d
e
, fe
a
t
u
r
e
s

a
r
e e
x
t
r
act
ed
f
r
o
m

t
h
e
ra
w d
a
t
a
and

sub
s
e
q
ue
nt
ly c
o
m
b
ine
d
., In
a
d
d
iti
o
n
, fea
t
ur
e-leve
l
fus
i
on

all
o
w
s
sy
nc
hro
n
iza
t
io
n
betw
ee
n cl
o
s
el
y co
up
le
d
moda
li
t
i
es for
a spea
k
i
ng
f
ace
, suc
h
as
vo
ice
a
nd
l
i
p

move
me
nts to be preser
ve
d
t
h
ro
ug
h
o
u
t

var
i
ous sta
g
es
o
f

a
u
th
en
ti
c
a
ti
on
, f
a
ci
li
t
a
ti
n
g

li
ven
e
ss v
e
ri
fi
catio
n
in

sy
st
e
m
s
tha
t
w
oul
d
o
t
herw
ise
be
m
o
re
vul
ne
rable
t
o
repla
y

attac
k
s.

3.1.
1

Acou
st
ic f
e
at
ures
The a
u
di
o
an
d
v
i
sua
l
fea
t
u
r
es i
n

B
M
F a
p
p
r
oach
we
re

e
x
t
r
act
e
d
fro
m e
a
c
h
f
r
ame
of
t
h
e
sp
e
a
ki
n
g

f
ace
vi
d
e
o

cl
ip
,
a
n
d th
e

jo
i
n
t
au
dio
-
vi
su
a
l
fe
at
u
r
e
v
e
ct
o
r
wa
s
fo
rmed
wi
t
h

direc
t
co
nca
t
e
n
at
ion
of
a
c
o
u
s
tic

an
d v
i
s
u
a
l
fea
t
ur
es
fr
om
the
l
i
p
-regi
on.

The
a
c
o
u
s
t
i
c
fe
a
t
ures
use
d
w
e
re Me
l
fre
que
nc
y c
e
p
s
tral c
o
effic
i
e
n
ts (
M
F
CC)
d
e
ri
ved
from

ce
pstrum
i
n
for
m
atio
n. T
h
e

p
r
e-
e
m
phas
i
zed

au
di
o
sig
n
a
l

w
a
s proce
s
se
d us
in
g a 30ms H
a
mm
i
ng w
i
n
dow
w
ith
one-
thir
d

ov
erla
p,
y
i
e
l
di
n
g
a
fram
e

rate
of
5
0
H
z
,
to o
b
t
ai
n
the

M
F
CC
acou
s
t
i
c
v
ect
o
r
s. An

a
c
o
u
st
i
c

fe
at
u
r
e
v
ect
o
r
was
deter
m
i
n
ed
for e
ach
fr
am
e b
y

wa
rpi
n
g

5
1
2
s
p
ec
t
r
a
l
ba
n
d
s

int
o
3
0
Me
l
-
sp
a
ced
ba
nds, a
n
d com
p
uti
ng
t
h
e 8
M
F
CCs.
Ce
pstra
l
me
an nor
ma
li
z
a
t
ion w
a
s
pe
rfor
m
e
d

on
a
ll
M
F
CCs be
for
e
the
y

w
e
re
used f
o
r train
i
n
g
, test
i
ng a
n
d

eval
ua
t
i
o
n
. Be
fore
ex
trac
ti
n
g

MFC
C
s,

th
e

a
u
d
i
o fi
les

from
the
t
w
o

data
ba
se
s, V
i
d
T
IMI
T
and
U
C
BN
,
w
e
r
e

mixed

w
i
t
h

ac
ous
tic
no
ise
a
t
a
si
gn
al-t
o-n
o
i
se
rat
i
o
o
f

6

dB. Cha
n
ne
l
e
f
fe
cts
w
i
t
h
a
tel
e
ph
o
n
e l
i
n
e fi
l
t
er we
re then

adde
d
t
o
t
h
e
no
is
y P
C
M
fi
les
t
o
sim
u
l
a
te

t
h
e
c
h
a
n
nel

mismatch.
3.1.
2

Visual featu
r
es

The
v
i
sua
l
fe
ature
s
use
d
w
e
re
geom
etr
i
c
an
d
i
n
te
ns
ity

fea
t
ur
es
e
x
t
r
acte
d

fr
om lip-r
egi
o
n
from all t
h
e fa
ces in t
h
e
vide
o

fra
m
e
.
Bef
o
r
e
the
l
i
p-re
gi
on
fea
t
ur
es ca
n
be

extrac
t
e
d, fac
e
s

ne
ed t
o

be
d
e
te
ct
ed a
nd
re
cog
n
i
se
d. The
fac
e
de
t
e
ct
i
o
n
for
vide
o w
a
s
based
on
t
h
e a
pproac
h
o
f

sk
in c
o
l
o
u
r

a
n
al
ys
i
s
in re
d-b
l
ue
c
h
rom
i
na
nce c
o
l
o
ur
spa
ce,

fo
llow
e
d
b
y
d
e
form
ab
l
e
tem
p
l
a
te
m
a
t
c
hin
g
w
i
t
h
a
n

aver
age face
, a
n
d

fina
lly ve
rifica
t
i
on wi
t
h
r
u
l
e
s deri
ve
d

f
r
o
m

th
e
spa
t
i
a
l/
g
e
o
m
et
ri
c
a
l

re
l
a
ti
on
sh
ip
s o
f
fa
ci
a
l

c
o
mp
on
e
n
t
s
. Th
e

li
p regi
o
n
was d
e
t
e
rmi
n
e
d

u
s
i
ng
deri
vat
i
v
es
o
f
hue a
n
d
sa
tura
t
i
o
n
fu
nct
i
ons,

c
o
m
b
i
n
e
d

w
i
t
h

ge
ome
t
r
i
c
c
o
n
s
tra
i
n
t
s.
F
i
gure
s

2(a)
t
o

2(c
)
show

som
e

of t
h
e
re
su
lts of the

fa
ce

detec
t
io
n
a
nd lip fea
t
ure

extrac
t
i
o
n
s
t
ag
es.
The sc
he
me
i
s

de
scri
be
d
in
m
o
re de
t
a
il

i
n
(Che
tt
y
an
d

W
a
gn
e
r
,
2
004b
).
S
i
mi
l
a
r t
o

th
e
au
dio
fil
e
s,
t
h
e
v
i
deo
da
ta

in
b
o
t
h da
t
a
ba
ses w
e
re m
i
xe
d w
ith

artific
ia
l vis
u
al
artefact
s suc
h
a
s
ad
di
t
i
o
n

of
G
a
ussi
a
n
bl
ur

and G
a
ussia
n

n
o
ise,
us
in
g a
vis
u
al e
d
it
in
g
to
ol

[A
do
be

P
hot
osh
op].

T
h
e


G
aussian B
l
ur” of P
h
ot
o
s
hop w
a
s set
t
o

1.2,
and “
G
aussia
n
N
o
i
s
e

o
f
P
hot
os
h
op t
o
1
.
6.

To
ev
a
l
u
a
t
e

t
h
e po
we
r o
f

the
fe
at
u
r
e-l
e
v
e
l
fusi
on

B
M
F
appr
oac
h

in p
r
ese
r
vin
g

t
h
e a
udi
ov
is
ua
l
sy
nc
hr
on
y,
and

hence
verific
a
tio
n o
f
liv
e
n
ess,
e
xperi
m
e
nts
w
e
r
e

c
o
ndu
c
t
ed

w
i
th
b
o
th
BM
F

an
d
l
a
t
e
fu
sion
o
f
a
udi
ovi
su
a
l

fea
t
ur
es. I
n

case
o
f

bim
oda
l

fea
t
ure
fusi
o
n
, t
h
e
a
u
dio
v
i
s
u
a
l
fus
i
o
n

i
nvol
v
e
d

a c
o
n
cat
en
a
t
ion
of

th
e a
u
d
i
o

fea
t
ur
es (MF
F
Cs-
8
) a
nd
visua
l

feat
ure
s
(eige
n
-
lip

pro
j
e
c
t
i
ons
(10
)
+ l
i
p
d
i
me
nsi
ons
(6))
, a
nd t
h
e c
o
mb
i
n
e
d

f
e
at
u
r
e
v
e
cto
r
wa
s t
h
en

f
e
d t
o
a
GMM
cl
as
si
fi
e
r
. Th
e
a
u
dio

f
eat
u
r
e
s

ac
q
u
i
r
e
d

at
50 Hz
,

a
n
d
th
e vi
su
al
f
eatu
r
e
s

acqu
i
re
d
a
t
25
H
z
w
e
re appropr
i
a
tel
y

ra
te in
te
rp
ol
a
t
e
d

t
o

obta
i
n s
y
nchro
n
ize
d

j
o
i
n
t
a
u
d
i
o
v
i
s
u
al fe
a
t
ur
e vect
ors.

F
o
r
late
fusi
o
n
,
aud
i
o an
d v
i
s
u
al
fe
at
ures w
e
re
fed
t
o

inde
pen
d
e
n
t G
MM class
i
fiers
and the

w
e
igh
t
ed sc
ore
s

(
!
)
(S
a
nderson a
nd P
a
liw
a
l
, 20
0
4
) from
eac
h s
t
a
g
e,
w
e
re
fed
to a w
e
ig
h
t
e
d
-
s
um fus
i
on
u
n
i
t.

F
i
gure
3

sh
ow
s va
rio
u
s
sect
io
ns o
f

bi
m
oda
l fea
t
ure
fu
si
on (
B
MF
)
mod
u
l
e.

Figure 3:
B
i
-
m
od
al
F
e
at
u
r
e F
u
s
i
on
m
o
d
u
l
e

3.2

Cro
s
s-Moda
l
Fusio
n

F
o
r
the
c
r
oss-
moda
l

fu
sio
n

ap
proa
c
h
,
t
h
e
pro
pose
d

fea
t
ures de
tect
t
h
e

l
i
v
ene
ss o
f

bi
om
etric
inf
o
rm
ation by

extra
c
tin
g t
h
e

fac
e
-vo
i
ce
s
ync
hro
n
y
i
n
fo
rm
atio
n i
n
a
cross-m
odal s
p
ac
e.
The c
r
o
ss-moda
l fea
t
ur
es prop
osed

a
r
e b
a
se
d

on
l
a
t
e
nt
se
ma
ntic
a
n
a
l
y
s
i
s
(LSA)
i
n
vol
vi
n
g


sing
u
l
ar
v
a
lue
dec
o
m
p
o
s
i
tio
n
of j
o
i
n
t
fa
ce
-
voic
e
fe
a
t
ure
space
, an
d
c
a
n
o
n
i
cal
corr
el
atio
n a
n
a
l
ys
is (CCA), based

on op
tim
isin
g c
r
oss-corr
elat
i
ons
in
a

rota
te
d au
d
i
o-v
i
s
u
a
l

sub
s
pa
ce.

Late
nt s
e
m
a
nt
i
c
ana
l
ys
is
i
s
a
pow
e
rful
t
o
o
l
use
d
i
n

t
e
x
t

inf
o
rm
ation re
trie
va
l to
di
s
c
ov
e
r
un
de
rl
yin
g

se
ma
n
tic

rela
tio
ns
hips b
e
t
w
een di
ffere
n
t
te
x
t
ua
l
u
n
its

(D
ee
rw
e
s
te
r,
and
H
a
rshma
n
, 2
0
01). T
h
e
L
S
A
te
ch
niq
u
e

achie
ves
thr
e
e

goa
ls
:

dime
nsio
n red
u
c
t
i
o
n, no
ise rem
o
val
a
n
d
t
h
e

unc
o
v
er
ing
of
t
h
e
se
ma
n
tic

a
nd hi
d
d
en rela
ti
o
n
b
e
tw
een

differ
e
n
t
obje
c
t
s
s
u
ch a
s

ke
y
w
or
ds a
n
d doc
um
ent
s
. In our

curr
ent c
o
nte
x
t,
w
e
used
LS
A
to u
n
c
ove
r t
h
e

sync
hro
n
ism

betw
ee
n ima
g
e
and a
u
dio fe
ature
s

i
n
a

v
i
d
e
o seq
u
e
n
c
e
.

The m
e
t
h
od
c
o
ns
i
s
ts

of four
s
t
ep
s:
co
ns
truc
t
i
o
n
o
f

a
jo
i
n
t

mu
l
t
i
m
od
al
f
eat
u
r
e
sp
ac
e,
n
o
r
mal
i
sati
on
, si
ngu
l
a
r
v
a
lu
e
dec
o
m
p
os
iti
o
n
a
nd sem
a
n
tic
a
ssoc
i
at
ion
me
asure
m
e
n
t
.

C
a
n
oni
ca
l
co
rrel
ati
o
n
an
a
l
ysi
s
, a
n
equ
a
l
l
y
po
we
rful

mult
i
v
aria
te s
t
ati
s
tica
l
te
c
h
n
i
que
, a
t
tem
p
t
s

t
o

fi
n
d
a
l
i
ne
ar
ma
ppin
g
tha
t
ma
xi
m
i
ze
s
t
h
e

cr
oss-
corre
lat
i
o
n
be
tw
ee
n

tw
o
fea
t
ure
s
sets (Bor
ga an
d K
n
u
t
ss
on, 1
9
9
8
). It find
s
t
h
e

trans
f
orm
a
ti
o
n
t
h
a
t
c
a
n

bes
t

r
e
pre
s
e
n
t
(
o
r
i
d
e
n
tify) t
h
e
cou
p
l
ed
pa
t
t
e
r
ns be
tw
ee
n

fe
at
ure
s
o
f

t
w
o
d
i
ffere
n
t

sub
s
e
t
s.
A
set
of
l
i
nea
r
ba
si
s f
u
n
c
t
i
on
s, hav
i
ng
a di
re
ct

re
l
a
t
i
o
n

to
max
i
mu
m mutu
al
in
fo
rmat
ion
,
is obt
ain
e
d

i
n

ea
ch si
g
n
a
l
s
p
ace
, such
tha
t

t
h
e
corr
elat
i
on ma
t
r
ix

betw
ee
n
the

si
g
n
a
l
s de
scr
i
bed

in
t
h
e
n
e
w
bas
i
s
is
dia
g
ona
l. The
b
a
s
i
s vec
t
ors c
a
n
be or
der
e
d
suc
h

t
h
at the

firs
t pa
ir of
ve
c
t
ors w
x1
and
w
y1
ma
xi
m
i
ze

t
h
e

c
o
rr
elat
ion

betw
ee
n the p
r
oje
c
tio
ns
(
x
T
w
x1
, y
T
w
y1
) of

the s
i
gna
l
s x

and y on
to t
h
e t
w
o ve
ct
ors

r
e
spec
tive
l
y. A

subse
t

o
f

vec
t
ors c
onta
i
n
i
n
g

the

firs
t
k pa
irs
d
e
fi
nes
a
l
i
n
ear
rank-
k
rela
tio
n
betw
e
e
n

t
h
e se
ts t
h
a
t
i
s

o
p
tima
l
i
n

a
corre
lat
i
on

sense.

In
ot
her w
o
rds, i
t
gi
ve
s the
li
near

c
o
mbina
t
io
n o
f

one
set
of vari
able
s t
h
a
t
i
s
t
h
e be
st
pred
ic
t
o
r a
n
d
a
t
t
h
e

sam
e

time
the
li
near

c
o
m
b
ina
t
i
o
n of an
o
t
her

se
t w
h
i
c
h

is
most pred
ic
ta
ble.

It

ha
s be
e
n

s
h
ow
n tha
t
fi
n
d
i
n
g
the
ca
non
ica
l
c
o
rr
elat
i
o
n
s

is e
q
ui
va
len
t

t
o
m
a
ximiz
i
n
g

the
mutua
l
i
n
form
ati
o
n betw
ee
n

the

se
ts if the
unde
rl
yin
g

distri
b
u
ti
on
s a
r
e ell
i
p
tica
l
l
y
s
y
mm
etric (Bor
ga a
n
d

K
n
u
t
ss
on, 1
998)
.
F
i
gure 4
s
how
s the pr
oc
essi
ng
s
t
a
g
es

f
o
r cro
ss-mo
da
l
f
eat
u
r
e

e
x
t
r
ac
ti
on
. Th
e
c
r
o
ss-mo
d
a
l
fea
t
ur
e ex
tra
c
t
o
r com
p
u
t
e
s
LS
A
and C
C
A
f
eature
vec
t
ors
from
low
-
l
e
ve
l
vi
s
u
a
l
a
n
d

aud
i
o fe
at
ures
. The
vis
u
a
l

f
e
at
u
r
e
s
are

2
0
PC
A
(e
i
g
enfa
c
e
) co
e
ffi
c
i
en
ts
, an
d
th
e
a
u
dio
f
eatu
r
e
s

are
1
2
M
F
CC

c
o
e
f
fi
ci
ent
s
. B
a
sed
o
n

prel
iminar
y e
x
pe
rime
nts,
fe
w
e
r
t
h
a
n

1
0
LS
A
a
nd
CCA
fea
t
ur
es a
r
e
norma
lly
fo
u
n
d
to be
su
ff
i
c
i
e
nt

to ac
hie
v
e

goo
d pe
rform
anc
e
.

T
h
is
is
a signi
fic
a
n
t
reduc
ti
on
of
f
e
at
u
r
e

di
me
nsi
o
n
co
mp
ared
wi
th

th
e 32
-di
m
en
s
i
on
al

aud
i
o
-
v
i
sua
l
fe
ature ve
c
t
or

form
ed by

co
nca
t
e
n
a
t
ed

bi
m
o
da
l feat
ur
e
fusi
o
n
of 2
0
P
C
A

and
12 M
F
CC
vec
t
or
s



Fi
gure
4
:
Cros
s
-
mo
da
l
Fu
s
i
on
module

3.3

3
D
Multi-Moda
l Fusi
o
n

For this ap
pro
a
c
h
, shape a
n
d t
e
x
t
ur
e fea
t
ur
es
from 3D
fac
e

m
ode
ls ar
e
e
x
t
r
acte
d
and
f
u
se
d with aco
us
tic
fea
t
ur
es. Befor
e
t
h
r
ee d
i
m
e
nsiona
l
fea
t
ure
s

c
a
n
be

extrac
t
e
d, a 3
D

face

m
odel ne
ed
s
t
o
be de
vel
o
ped
usin
g
appr
opria
te
mode
l
lin
g te
c
h
n
i
que
base
d
o
n

fac
i
a
l

i
n
for
m
a
t
io
n a
v
ai
la
bl
e from

t
h
e
da
ta
corp
us.
The

V
i
dTI
M
I
T

dat
a
base for
e
x
a
m
pl
e
,

c
onsis
ts of fron
t
a
l

an
d

profi
l
e
v
i
ew

i
m
a
g
es of
the
fac
e
s,
and A
V
O
ZES

data

com
p
r
i
se
s le
ft
a
n
d
rig
h
t
ima
g
es o
f
t
h
e
fa
ces.
The
3D fac
e

model
i
ng
w
a
s based on a
p
p
r
oac
h

pro
p
o
se
d
b
y
G
o
rd
on,

(19
95), and
H
s
u and Ja
in
,

(
2001)
.
The 3D
face

mode
li
ng a
l
g
o
r
it
hm
st
a
r
ts b
y
com
p
u
t
i
n
g 3D

c
o
o
r
d
i
n
a
t
e
s of
au
to
ma
ti
cal
ly

ext
r
ac
t
e
d
fa
ci
al
f
eatu
r
e
poi
nt
s. C
o
rrespo
n
d
enc
e
be
t
w
e
e
n fea
t
ur
e po
i
n
ts i
n

b
o
t
h

i
m
age
s
is esta
bl
is
he
d us
ing e
p
ip
olar
c
ons
tra
i
n
t
s,
an
d t
h
e
n

dep
t
h
i
n
form
ati
o
n from
fro
n
t a
nd
pro
f
i
l
e
vi
e
w
s for
Vi
d
T
I
M
I
T

faces, an
d
,
l
e
f
t
and

r
i
ght
v
i
ews
fo
r

AVOZE
S

fac
e
s, is com
p
ute
d

u
s
in
g per
s
pec
t
i
v
e
proje
c
t
i
o
n. The
3D

coor
di
na
tes
of
the
se
lec
t
e
d
fe
a
t
ur
e
p
o
in
ts ar
e
the
n
use
d
to

deform
a 3D
gener
i
c

fac
e
mode
l t
o
ob
tai
n
a
perso
n

spe
c
i
fic 3D
f
ace
mode
l
.
F
i
gure
5
sh
ow
s
the sam
p
le

fron
ta
l an
d pro
f
ile
fac
e
fr
om V
i
dTI
M
I
T
dat
a
base
an
d 3D

fac
e
de
ve
l
ope
d
.


The
t
e
ch
n
i
q
u
e
s
pro
pose
d

t
i
l
l

da
t
e
for pr
oce
ssin
g
a
n
d

i
n
te
gr
at
i
o
n o
f
sha
p
e

an
d te
xt
u
r
e fea
t
ur
es
ar
e i
n

t
h
e
3D
fac
e
r
eco
gn
i
t
i
o
n

doma
i
n,
a
n
d
ha
ve
e
v
o
l
ved

base
d
on
t
h
e

assump
ti
o
n
t
h
a
t
the
r
e is
n
o
co
rr
elati
on be
twe
e
n sha
p
e an
d

t
e
x
t
ur
e fea
t
ur
e
s
of a
3D

face
. Th
is
m
i
g
h
t
be
true for

sta
t
ic

3
D

f
a
c
e
s
,
a
n
d mo
st
of t
h
e re
search effor
t
s so f
a
r
ha
ve
ma
i
n
l
y
a
d
dre
s
se
d
r
eco
g
n
i
t
io
n
of sti
l
l

3D

fa
ces (H
su a
n
d

Ja
in,
20
01)
. But
a sp
ea
k
i
ng
fa
ce i
s

a

kine
ma
t
i
c
-
ac
ous
tic

sys
t
em
in
m
o
t
i
o
n
,
and
the s
h
a
p
e,

te
xtur
e and ac
oust
i
c

fea
t
ures d
u
rin
g
spe
e
c
h
pr
o
duc
t
i
on m
u
st b
e
c
o
rre
l
a
t
e
d in

some
w
a
y
or ot
her.
A
num
ber
of
stu
d
i
es
c
a
rried
ou
t
by

Y
e
hia,
H
.
,
Rub
i
n,
P
.
and V
a
ti
k
i
o
tic
-Ba
t
es
on
E.,
(199
8),
and
H
a
n
i
Y
e
hia,
Ta
kaa
k
i
K
u
ratate,
Eric

V
a
ti
kio
t
ic-
Ba
teso
n, (200
2
)
, have

dem
o
ns
trate
d
this c
o
rre
lat
i
on
base
d
on t
h
e an
a
t
om
ical
fac
t
s,

t
h
a
t
a sing
le
neur
o
m
ot
or so
ur
ce

con
t
ro
l
l
i
n
g the
voca
l
trac
t beh
a
vior

is
r
e
s
p
o
n
s
i
b
le
for
bo
th

t
h
e
a
c
ou
st
i
c
a
n
d
th
e
visi
b
l
e
a
t
t
r
ib
ut
es o
f
sp
ee
ch
pro
duc
t
i
o
n
. H
e
nc
e, f
o
r a s
p
ea
ki
n
g

fac
e

n
o
t o
n
l
y

the
fac
i
a
l

moti
o
n
a
nd
s
p
eec
h ac
o
u
st
ics

are
cor
r
ela
t
e
d
,
bu
t
t
h
e
hea
d

moti
o
n
an
d f
u
ndam
e
n
t
a
l
fre
q
uenc
y
(F
0)
produce
d

d
u
ri
ng

speec
h are
als
o
rela
te
d. Th
o
u
g
h

t
h
e
r
e i
s

no c
l
ea
r an
d
dist
i
n
ct ne
ur
o
m
otor co
u
p
l
i
n
g
b
e
tw
een
hea
d
mot
i
o
n

a
nd
speec
h aco
us
t
i
c
s
, there
i
s
a
n

i
n
d
i
re
ct a
n
a
t
o
m
i
c
al c
o
u
p
lin
g
cre
a
ted b
y
the
com
p
l
e
x
o
f

s
t
ra
p
m
u
sc
les
r
u
n
n
i
n
g b
e
tw
ee
n

the flo
o
rs of t
h
e mou
t
h, t
h
r
oug
h t
h
e

t
h
yroi
d bo
ne,

and

attac
h
i
ng
t
o

th
e ou
ter edge
o
f

t
h
e cr
i
c
o
t
hyr
oid ca
rt
i
l
a
g
e.
D
u
e
to
t
h
i
s

i
n
d
i
rec
t

c
o
up
li
ng,
a
s
p
ea
ke
r te
n
d
s
t
o
rais
e
the

pitc
h w
h
e
n
he
a
d
g
o
es
up w
h
ile
ta
lk
i
ng. T
h
e
head m
o
t
i
on

ca
n be
m
ode
led
b
y

t
r
ack
i
ng 3D
fac
e
sha
p
e
s

w
i
th
com
p
lem
e
ntar
y a
nd sy
nc
hr
on
ous 2D
f
acia
l
fe
a
t
ure

varia
t
i
o
n, and 1D
ac
oust
i
c va
ria
tio
n. Thi
s

u
n
i
q
ue an
d rich

i
n
fo
rma
t
i
o
n i
s
no
rma
l
ly

p
e
rs
o
n
-
sp
eci
fi
c
a
n
d c
a
nno
t be
e
a
s
i
l
y
spo
o
f
e
d ei
t
h
er by
a re
al
i
m
po
st
e
r
, o
r
C
G
an
i
m
a
t
e
d

spea
ki
n
g

face
s.

He
nce a mult
i
m
oda
l fu
sio
n
of 3D
s
h
a
p
e,

tex
t
ure a
n
d
acous
t
i
c
fe
atures c
a
n
enha
n
c
e
t
h
e
p
e
rfo
rma
n
c
e
o
f
f
a
ce
-voi
c
e

aut
h
ent
i
c
a
t
i
o
n
sy
st
ems an
d
che
c
k l
i
v
e
n
ess of b
i
ome
t
ric da
t
a
pr
esen
ted
t
o

t
h
e
s
y
stem
.

Figure 5:

3D

face
m
odel f
o
r V
i
dTI
M
I
T
fa
ce
The ma
j
o
r

de
form
ati
ons for the
spea
k
i
n
g
f
ace

a
r
e in t
h
e
l
o
we
r p
a
rt

o
f

th
e
f
a
ce
co
mp
ared

t
o
rest
of
th
e
f
a
ce
. Hen
c
e
the
l
o
w
e
r ha
lf o
f

the
face
w
a
s use
d

f
o
r
3
D
m
u
l
t
imo
d
a
l

fus
i
on. T
h
e lo
w
e
r part of
t
h
e
fac
e
w
a
s mod
e
l
e
d us
i
ng
128

v
e
rti
c
es

an
d

200
su
rf
ac
es
. This me
an
s a fu
sio
n

of

a
c
ou
sti
c

vec
t
or w
i
th 1
2
8
dim
e
ns
ion
a
l

sha
p
e (X
,

Y,

Z) vec
t
or

a
n
d

si
m
i
lar
si
z
e

fo
r te
xtur
e feat
u
r
e
ve
ctor va
lu
e
s
.
Thi
s

i
s

t
o
o
l
a
rg
e a di
men
s
io
n fo
r a
rea
s
on
a
b
l
e

perf
o
r
ma
nce t
o

be

ac
hi
e
v
e
d
. H
o
w
e
ver
,
a
f
te
r pr
i
n
c
i
pa
l
com
pone
n
t

ana
l
ys
is(P
CA
) of
t
h
e sha
p
e ve
ct
or
a
nd
the

te
xt
ure ve
ct
or
se
p
a
ra
t
e
l
y
, we
l
e
a
r
nt
th
at
ab
ou
t 6
-
8 p
r
in
ci
p
a
l
co
mp
on
en
ts
of s
h
ape
ve
ct
or and 3-
4 co
m
pone
n
t
s
o
f

t
e
xture
vec
t
or

exp
l
a
i
n
s
more
tha
n

95
% o
f

var
i
a
t
i
o
ns i
n
l
i
p
sh
apes a
n
d
appea
r
a
n
c
e
s
d
u
ri
ng

sp
o
k
en
ut
t
e
ranc
es
of m
o
st of t
h
e

En
glis
h la
ng
ua
ge se
n
t
e
n
ces.


Figure 6:

Pr
i
n
c
i
pa
l v
i
se
me
s dur
in
g En
gl
is
h
speak
in
g
The
8 e
i
ge
n-v
a
l
u
e
s
for sha
p
e vect
or corre
spo
n
d
t
o

jaw

o
p
e
n
i
ng/
c
l
o
s
ing
,
li
p
p
r
ot
ru
si
on/
re
t
r
ac
t
i
o
n
,
l
i
p

op
e
n
i
n
g
/
cl
os
i
n
g, an
d
ja
w
pr
otr
u
si
on/re
trac
tio
n a
s
s
how
n
i
n
F
i
gure

6. S
i
m
i
l
a
r
l
y, the

3
-
4 Eige
n va
lu
e
s
of
t
e
x
t
ur
e
vect
or de
scri
b
e
m
o
st o
f
t
h
e
a
ppear
a
n
ce
var
i
at
i
o
n
s
ma
i
n
l
y

t
h
ose
c
o
rrespo
n
d
i
ng t
o
one

r
o
u
nde
d visem
e

w
i
th
c
l
osed

l
i
ps,
(e.
g
. [‘u’])
,
one
roun
de
d
visem
e

w
ith
o
p
e
n

l
i
ps,
an
d
on
e s
p
read vis
e
m
e
wit
h

s
p
read lip
s
,
(e.g. [

i
’])
.

The
1
8
-
d
i
m
en
a
i
o
n
a
l
a
u
d
i
o-v
i
s
u
al
fea
t
ur
e ve
ctor for 3D

mu
l
t
i
m
od
al
f
u
sio
n

modul
e
wa
s c
onst
r
u
c
t
e
d

by

conca
t
e
n
a
tin
g
8 MF
C
C
s
+ 1

F
0

fea
t
ure
,
6 ei
gen-
sha
p
e

and
3 eige
n
t
e
x
ture fea
t
ure
s
. The f
u
n
d
a
m
e
n
tal fre
que
nc
y
F0 wa
s c
o
mpu
t
ed
by

au
t
o
corr
elat
i
on m
e
t
h
o
d
.

4

Liven
e
ss
Experimen
ts
To i
n
ve
st
iga
t
e
the

p
o
t
en
tia
l of pro
p
o
se
d
f
u
sio
n

appr
oac
h
es, th
a
t

is b
i
m
o
dal

feat
ure

fus
i
on
;

cr
oss-m
oda
l
fus
i
on
a
nd 3
D

mul
t
i
m
o
d
a
l
fusi
o
n
, d
i
ffe
rent
sets

of
exper
i
m
e
nts w
e
re
cond
ucte
d.

In
the
trai
n
i
n
g

phase
, a 10-
G
a
ussia
n

m
i
xt
ure mode
l
o
f

eac
h
c
lie
nt’s
fea
t
ure vec
t
or
s in

t
h
e thr
ee dime
nsi
ona
l

spa
ce w
a
s b
u
i
l
t

b
y

c
o
nstruc
tin
g a
g
e
nder-spe
c
i
fic

uni
ver
s
al ba
c
k
gro
u
n
d
mo
del
(U
BM) an
d t
h
e
n

a
d
a
p
t
i
n
g

eac
h
U
B
M b
y
MA
P
ada
p
t
a
t
i
o
n

(
R
e
y
no
l
d
s
a
n
d
D
u
nn,

200
0)
.
In
t
h
e
tes
t
p
h
ase
,
cl
i
e
nts’

l
i
ve
t
e
st
rec
o
rd
in
gs wer
e

eval
ua
te
d a
g
ai
ns
t
a
c
l
ie
nt’s
mode
l
!

by de
term
inin
g t
h
e

l
o
g
l
i
k
e
lih
oods

l
o
g p
(X|
!
)
of
the
t
i
me
seque
nces
X
of
aud
i
o
v
is
ua
l fe
ature ve
c
t
ors. A
Z-
norm ba
sed ap
proa
c
h

(A
ucke
nt
ha
l
e
r
a
n
d
Ca
rey,

19
9
9
) w
a
s us
ed
for sc
or
e
no
r
m
a
liz
at
ion.

F
o
r t
e
stin
g
r
e
pla
y

a
t
ta
cks, tw
o type
s o
f
repla
y
a
tta
ck

exper
i
m
e
nts
w
e
re

c
o
n
d
u
cte
d
.
F
o
r
Typ
e
-1

rep
l
ay
a
ttac
k
s, a
nu
m
b
er
o
f

f
ake

r
ecor
d
i
ngs w
e
re
co
ns
t
r
ucte
d
by

com
b
i
n
in
g t
h
e

se
que
nce

of aud
i
o
fea
t
ure
vect
ors
from

eac
h tes
t

u
t
ter
a
nce
w
i
th O
N
E vi
sua
l

fea
t
ur
e ve
ct
or
ch
ose
n

from
the
se
qu
ence
o
f
v
i
sua
l

fe
at
ure
vec
t
ors
.

S
u
ch a
fa
ke

se
q
u
ence
repr
esen
ts a
n
a
tta
ck
on
t
h
e
a
u
the
n
tic
a
t
i
o
n

s
y
st
e
m
,
wh
i
c
h i
s
c
a
rri
ed
ou
t

b
y
repl
a
y
i
n
g

an
au
di
o
rec
o
rding
of
th
e
clie
n
t
’s u
t
te
r
a
nce
w
h
il
e

prese
n
t
i
n
g
a st
i
l
l

pho
to
gra
p
h
t
o
the c
a
m
e
ra
.
F
our
suc
h

fak
e
aud
i
ov
i
s
ua
l

se
q
u
ence
s w
e
re
cons
truc
te
d
from
diffe
r
en
t s
til
l fra
m
e
s
of
eac
h c
lie
n
t
te
st r
ecor
d
ing.

Lo
g-li
ke
l
i
h
o
o
d
s

l
og p(
X’|
!
)

w
e
re
com
pute
d
f
o
r t
h
e
fa
ke

seq
u
enc
e
s X


of
a
u
di
o
v
i
s
ua
l

fea
t
ur
e ve
ct
ors
a
g
ain
s
t

the
c
l
i
e
nt
mode
l
!
.
Fo
r
Ty
p
e
-
2

re
pla
y
a
t
tac
k
s,
a
sy
n
t
he
tic
vide
o
cl
i
p
w
a
s

con
s
t
r
uc
te
d fr
om
a stil
l p
h
o
to
of e
a
c
h
s
p
ea
ker.
This
repre
s
ents
a
sc
enar
io
of a
rep
l
a
y

at
tac
k

wit
h
a
n

imp
o
st
or
presen
ti
n
g
a

fake
v
i
deo
c
l
i
p
c
o
nstruc
te
d
from
pre
-
rec
o
rded
a
u
d
i
o an
d
a

sti
ll ph
o
t
o of th
e clie
n
t

an
im
ate
d

w
i
t
h

fac
i
a
l
m
ovem
e
nt
s
a
nd v
o
ice
-
s
ync
hro
n
o
u
s l
i
p

movem
e
nt
s.
The st
ill
ph
o
t
o
of ea
ch cl
ie
nt w
a
s v
o
ic
e-
syn
c
he
d w
ith the

spe
e
c
h
sig
n
a
l
o
f
ea
ch spe
a
ker
,
usi
n
g a

s
e
t
of

c
o
mme
rci
a
l

soft
wa
re
t
o
o
l
s (Ad
o
b
e
P
h
ot
o
s
hop

El
ement
s
, Di
scre
et
3
D
SM
a
x
, a
n
d
Ado
b
e

Af
t
e
r Eff
e
cts
)
.

We c
o
nstruc
t
e
d sever
a
l fa
ke
vi
deo c
l
ips
by e
x
trac
t
i
n
g

O
N
E
fac
e
(the
fi
r
s
t
fa
ce)
fro
m
t
h
e
v
i
de
o se
que
nce
,
w
h
ic
h

acts as a
ke
y
fra
m
e
,
anim
ated t
h
e
li
p reg
i
on
of
t
h
e ke
y

f
r
a
m
e

b
y
phon
e
m
e
-
t
o
-vi
s
eme ma
p
p
i
ng
, and
th
en

a
d
d
e
d

rand
om

def
o
rma
t
i
o
ns
a
nd m
ovem
e
nt
s in
th
e
fa
ce

an
d

fi
na
lly ren
d
er
ed
l
i
p
an
d fac
e
m
ovem
e
nt
s
w
i
t
h
s
p
e
e
c
h
,
a
l
l

toge
ther a
s
a

new
v
i
de
o c
l
i
p
. The
syn
t
hes
i
ze
d
fa
ke
cl
ip

visua
lly em
u
l
ates a n
o
rm
al
t
a
lk
in
g
head
wit
h
cer
t
a
i
n

fac
i
a
l
a
nd
hea
d
m
ovem
e
nt
s i
n
thre
e
di
m
e
ns
i
ona
l s
p
a
ces
i
n

sy
nc
h
r
oni
sm w
i
t
h
sp
ok
en

utt
e
ra
n
c
e
.

P
e
rform
a
nc
e
i
n
te
rms of
D
E
T curve
s
an
d EER
ra
tes
w
a
s
exa
m
ine
d
f
o
r
te
xt-
d
epe
n
de
nt an
d te
x
t
-i
nde
pe
nde
n
t

exper
i
m
e
nts. F
o
r al
l
e
x
per
i
m
e
nt
s, the
thres
h
o
l
d
w
a
s
se
t
us
i
n
g da
ta fro
m

the test data

set.
Th
e
result
s
obt
ai
n
e
d f
o
r

each of the fusion appro
a
c
h

is descri
b
e
d
next.
4.1.
1

Results
f
o
r BFF a
ppro
a
ch

F
o
r b
i
m
o
dal fe
ature
f
u
si
on appr
oach o
n
l
y

type-
1
replay
a
t
t
a
ck
s we
re
st
udi
e
d
, a
n
d
t
r
ai
ni
n
g
w
i
t
h
Vi
dTIM
IT an
d
U
C
B
N
c
o
rp
us
w
a
s d
o
n
e
w
ith
p
o
se
a
n
d
i
llum
i
na
tio
n
norm
a
l
i
z
a
t
i
on
of fa
ces.
F
o
r
V
i
dTI
M
I
T
, 24 m
a
le a
nd 19

fem
a
le
cl
ien
t
s w
e
re use
d

to cre
a
te se
p
a
rate ge
n
d
er
sp
e
c
i
f
i
c

un
iv
ersa
l
ba
c
k
g
r
oun
d mo
del
s
.
Th
e f
i
rst

t
w
o
utte
ranc
es
for
a
l
l s
p
ea
kers in
t
h
e
cor
pus
be
in
g com
m
on

w
e
r
e
used for
te
xt

de
pen
d
e
n
t
expe
r
i
me
nt
s a
nd 6
di
ffere
n
t
utte
ranc
es
for
e
ach
spea
ker
a
llow
e
d
te
xt
i
nde
pe
nde
n
t

verific
a
t
i
o
n e
x
pe
r
i
me
nts t
o
be c
o
nd
uc
te
d. F
o
r
t
e
x
t

inde
pen
d
e
n
t e
xper
i
m
e
nts, fo
ur ut
t
e
ranc
es
from

sessio
n

1

w
e
r
e
used
for
tr
ain
i
n
g
a
nd
fo
ur u
tte
ranc
es
from
sessi
on
2

and
3

w
e
r
e
use
d
for
tes
tin
g.

Fo
r
t
h
e
UCBN d
a
t
a
b
a
se
,
t
h
e t
r
a
i
ni
ng

d
a
t
a
fo
r bot
h

t
e
xt

depe
n
d
e
n
t
an
d
t
e
x
t

i
n
depe
n
d
e
n
t e
xper
i
m
e
nts
con
t
ai
ne
d
1
5

utte
ranc
es
fro
m
5 ma
le an
d 5
fe
ma
le s
p
eake
r
s, a
n
d
5
utte
ranc
es
f
o
r
t
e
st
in
g, e
a
c
h
re
corde
d
in
a different
ses
s
ion.

Th
e

u
t
t
e
ra
n
ces
we
re o
f
20
-s
ec
on
d

du
ra
t
i
on
fo
r
t
e
xt

depe
n
d
e
n
t
ex
p
e
rim
e
nts
an
d
o
f

40-sec
ond d
u
r
a
ti
on i
n
t
e
x
t

inde
pen
d
e
n
t

m
ode.
S
i
milar
l
y
t
o
V
i
d
T
IMIT,
sepa
rate

UBM
s
for the m
a
le and
fem
a
le
co
ho
rt
s
were
c
r
ea
t
e
d

f
o
r
UCB
N
d
a
t
a
.
Tab
l
e 1 s
h
o
w
s t
h
e
num
ber
of
cl
ient
tr
ial
s
an
d
repla
y
at
tac
k
tri
a
l
s
co
nducte
d
f
o
r
e
x
am
i
n
in
g
t
h
e

perf
orm
a
nce
of
bim
odal
feat
ure f
u
si
o
n
m
o
d
u
le.
Th
e
fi
rst
r
o
w
i
n
Table
1
refers t
o
e
x
p
e
r
i
m
e
nt
s
w
i
t
h

t
h
e
Vi
dTIM
I
T
dat
a
base
i
n
t
e
xt
dep
e
n
d
ent
m
o
de
for
a m
a
le-on
l
y

coh
o
rt
, c
o
m
p
ri
sin
g
a
t
o
t
a
l
of
48 c
l
i
e
nt t
r
i
a
ls (
2
4 cl
ient

s

×
2 ut
ter
a
n
c
es
p
e
r
cli
e
nt
)
an
d
19
2 r
e
p
l
a
y
a
t
t
ack t
r
i
a
ls
(
24
cli
e
nts
×
2 ut
tera
nce
s

×
4 fa
k
e

se
quences per c
l
i
e
nt).

A
s

a basel
i
ne pe
rform
a
nc
e m
e
a
s
ure,

both l
a
te fusi
o
n

a
n
d

fea
t
ure fu
sio
n
ex
per
i
m
e
n
t
s
w
e
re conduc
te
d w
i
t
h

conca
t
e
n
a
t
ed aud
i
o-v
i
sua
l
fea
t
ure v
e
c
t
or

d
e
scri
be
d i
n

sect
io
n 3.
1. T
h
e

r
e
s
u
lts
for D
B
1TIMO
(V
idTIMIT
data
ba
se
te
xt
-in
d
epe
n
den
t

m
a
le-
only
coh
o
rt) a
n
d

DB2
T
DFO (UCB
N d
a
t
a
b
a
se t
e
xt
-
d
ep
end
e
nt

f
e
mal
e
-on
l
y
cohort)
experim
e
nts ar
e r
e
port
e
d
he
re.

A
ll la
te
fus
i
on

exper
i
m
e
nts
ha
d va
ryi
n
g
c
o
m
b
ina
t
i
o
n w
e
i
g
h
t
s

!
’ for
com
b
in
in
g
a
u
d
i
o a
n
d v
i
s
u
a
l

sc
ores.

!

is va
ri
ed from
0
"
1
w
ith
!

incr
eas
i
ng for
incre
a
si
ng v
i
s
u
a
l
scor
es.
A
s

show
n Ta
b
l
e
2, the
base
l
i
ne EE
R ac
h
i
e
v
ed
i
s

3.6
5
%
for D
B
1TIM
O

a
nd 2.
55%
for D
B
2TD
F
O
for fe
at
ure

fus
i
on, as c
o
m
p
ar
ed to
8.1% (D
B1TI
M
O
) a
nd 6.8%
(
D
B
2
T
D
F
O
) a
c
h
i
ev
ed
f
o
r l
a
t
e
fu
s
i
on
w
ith
!
=0.
7
5
.
In
F
i
gure
3,
t
h
e
be
ha
v
i
o
u
r o
f

the
system

is show
n w
h
e
n

sub
j
e
c
t
e
d

to
d
i
f
f
ere
n
t ty
pe
s
o
f
e
n
v
i
r
o
nme
n
ta
l
degra
d
at
ions a
s
is the
EE
R sensi
tiv
i
t
y

t
o
varia
t
i
o
n
s
in
trai
ni
n
g
data s
i
z
e
.

Once

a
g
ai
n, fe
ature le
ve
l
fu
s
i
o
n
o
u
tpe
rfor
m
s l
a
t
e
fus
i
o
n

for
aco
us
t
i
c
a
nd
v
i
s
u
al
de
gr
adat
i
o
n
s
. Whe
n
m
i
xe
d w
ith

acou
s
t
i
c n
o
i
se
(
F
ac
t
o
r
y

no
is
e at 6 d
B
S
N
R

+
c
h
a
nne
l

e
f
fe
ct
s), f
e
at
u
r
e
fu
si
on
allows
a perform
ance
i
m
pro
v
em
ent
o
f
t
h
e
or
der of 3
8
% com
p
are
d
t
o
l
a
t
e

fu
si
o
n

(
!
=0.2
5)
, a
nd 18
%
for
l
a
te f
u
si
on
(
!
=0.75)
.
W
h
en

mi
x
e
d
wi
th
v
i
su
al

art
e
f
act
s, t
h
e

i
m
p
r
ov
e
m
e
n
t
in
perfor
ma
nce
ac
hie
v
e
d
w
i
t
h
feature

f
u
si
on
is a
bou
t
30
.
4
0% as c
o
m
p
ar
ed t
o
LF (
!
=0.2
5), a
nd
18.9
%
w
i
th LF

(
!
=0.7
5)
.
Tab
l
e 3
sh
ow
s t
h
e base
l
i
ne EE
Rs

ac
h
i
e
v
ed

an
d

EER
s
ac
hie
v
e
d

wi
th
inc
l
u
s
i
o
n

o
f

vi
sua
l
a
r
tefa
cts, ac
ous
t
i
c
noise

an
d s
h
orter
trai
n
i
ng

da
ta.
The
ta
b
l
e

als
o
s
how
s
a

dro
p
in pe
rform
anc
e
due
t
o
l
a
te
fus
i
on a
nd
fe
at
ure
fus
i
on.


Table 1
: Nu
mber of Clie
nt and Re
p
l
a
y

att
a
ck
trials


Table 2:
R
e
lativ
e
performan
c
e

o
f

BFF w
i
th

ac
o
u
s
t
ic
nois
e
, v
i
s
u
al
artifa
c
t
s

an
d
v
a
ria
t
io
n
in
tr
a
i
ni
ng
d
a
t
a

size
.
The inf
l
ue
nce

o
f
trai
ni
n
g

u
ttera
nce len
g
t
h
va
ria
t
i
o
n
on

sys
t
em
perfor
ma
nce is
q
u
ite

rem
a
rkable and different
as

com
p
a
r
ed
to

o
t
her effec
t
s.

T
h
e syste
m
i
s
mo
r
e

se
nsi
t
ive

t
o

utter
a
nce

le
ng
th
varia
t
i
on
for fea
t
ure
fus
i
o
n

m
o
de as

com
p
a
r
ed
to
l
a
te f
u
s
i
o
n
m
ode
(Tab
le 2)
. The
dr
op i
n

perfor
ma
nce is
l
e
ss (9.46%
for l
a
te
fus
i
on
(
!
=0.75)) and

(
26.
5
7
% for

la
t
e
fus
i
o
n
(
!
=0.25)
) as com
p
are
d

t
o
42.3
2
%

dro
p
for
fea
t
ur
e
fus
i
o
n
f
o
r D
B
1TIMO
,
a
nd
li
kew
i
se,
the
dro
p
i
s

1
2
.15
%

a
n
d 2
4
.5
3%
as c
o
mpa
r
ed
t
o
4
0
.9
6% dr
op

for D
B
2TD
F
O
.
The ut
tera
nce le
n
g
t
h

is
va
rie
d
fr
om 4

se
co
nds
t
o

1
s
econ
d

f
o
r DB
1
T
IMO an
d
fro
m
20
seco
n
d
s
t
o

5 sec
o
n
d
s
fo
r D
B
2TD
F
O
da
ta.
This dro
p
i
n

pe
rfor
ma
nce
is beca
use
of a large
r

di
m
e
n
s
i
o
nal
ity

of the jo
int a
u
d
i
ov
is
ua
l
fe
ature vec
t
ors

use
d
(8

MF
C
C
s
+
1
0

e
i
ge
n
l
i
p
s+6

li
p
d
i
m
e
ns
ions)
,
a
s
w
e
ll

as
t
h
e s
hor
ter u
t
t
e
ran
c
e l
e
n
g
th,

wh
ich
se
em
s
to
be n
o
t

suf
f
icie
n
t
to e
s
t
a
b
lis
h the
au
d
i
ov
i
s
ua
l

sync
hr
on
y.
4.1.
2

Result
s f
o
r CMF
appr
o
a
c
h

F
o
r
cross m
odal fusi
o
n
a
ppr
oach,
b
o
t
h

typ
e
-
1
and
ty
pe-
2

repla
y
a
t
t
a
c
k
s
w
e
re stud
ie
d,
and
tra
i
n
i
n
g
w
i
t
h
V
i
dTIMIT

and U
C
BN
c
o
rpus w
a
s d
one
,
w
ith p
o
se a
n
d i
llum
i
na
t
i
o
n

norm
a
l
i
z
a
t
i
o
n
of
fac
e
s. The

pe
rform
anc
e
of LS
A

and

CCA fe
a
t
ures
for C
M
F mod
u
l
e we
re com
p
ar
ed wi
th
conca
t
e
n
a
t
ed
B
M
F
fea
t
ur
es
(
20
P
C
A
+
1
2
MF
C
C
) fe
at
ure
s

for b
a
sel
i
n
e
c
o
mpa
r
ison.

T
h
e EER resu
lts in Ta
ble 3

show
the
po
te
n
tia
l
of
the LS
A
a
n
d
C
C
A

feat
ure
s

of
CMF

modu
le
over
B
M
F
modu
le
for
ty
pe
-1
r
e
p
l
ay at
tac
k
s.

An
impro
v
e
m
e
n
t of 80%

w
i
t
h

8-d
i
m
e
nsi
ona
l
LS
A

fe
atures
and 6
0
% w
i
th
8-d
i
m
e
ns
io
na
l CCA
fe
at
ure
s

is
a
c
h
ieve
d
over co
nca
t
e
n
a
t
ed 3
2
-d
ime
n
si
ona
l
B
M
F
fus
i
on
ap
proa
c
h
.



Table
3
:

EER
s
for ty
pe
-1 re
play
attacks



Table
5
:

EERs for
ty
pe-
2
re
play attac
k
s


Tab
l
e 4 sh
ow
s
the impr
o
v
e
m
ent in e
rror

rates a
c
hie
v
e
d

for

ty
pe-2

r
e
pla
y
a
t
t
a
cks. A
ppr
o
x
i
m
ate
l
y
43
%

impro
v
e
m
e
n
t

i
n
EE
Rs w
i
t
h
8-
d
i
m
e
nsi
o
na
l
LS
A
fe
atures
and
2
2
% w
i
t
h

8-
d
i
m
e
nsi
o
na
l
CCA
fe
a
t
ures
is ac
hi
e
v
e
d
.
Th
is
i
s
a
r
e
m
a
rka
b
l
e

i
m
pro
v
e
m
ent
in E
E
R
s,
due
t
o

a
b
i
l
ity

of LSA and

CCA
fe
a
t
ures t
o

detec
t
mism
atch in

syn
c
hro
n
y

in vide
o re
p
l
a
y

at
t
a
c
k
s.

4.1.
3

Results
f
o
r 3MF a
ppro
a
ch

F
o
r 3D

m
u
lti
m
odal fus
i
on
appr
oa
ch,

th
e im
po
st
or and

type-
1
,
type
-2

r
e
pla
y
a
tta
ck
s
w
e
re
studie
d
,
and
t
r
a
i
n
i
ng

wi
t
h

Vid
T
IMI
T
an
d
AVOZES
co
rp
u
s
wa
s

d
o
n
e
.
Th
e
resul
t
s for o
n
l
y
tw
o t
y
pes
of
da
ta,
th
a
t

is
D
B
1TI
M
O

(V
idTIMIT da
taba
se t
e
xt in
d
e
pe
nde
nt
m
a
le
-
only c
o
h
o
rt)

and
D
B
2TD
F
O
(A
V
O
Z
ES da
t
a
base
te
xt
d
e
pe
n
d
e
n
t
fem
a
le
-on
l
y
c
o
hor
t) ar
e repor
t
e
d he
re.

F
o
r
bot
h ty
pe
s of
data,

bo
t
h

lat
e
-fusio
n
a
n
d fe
ature
le
vel
f
u
sio
n

o
f
s
h
ape

and
te
xt
ure fe
a
t
ures w
e
r
e
exa
m
ined.
F
o
r lat
e
-
f
us
i
o
n eq
ua
l
w
e
ig
ht
s
for s
h
ape a
n
d

feat
ure
fus
i
on w
a
s

use
d
. Tab
l
e
6

show
s
t
h
e

n
u
m
be
r of c
l
i
e
n
t
,
imposter
an
d
re
pl
a
y
a
t
tack

tria
ls for
3
M
F modu
le.

The
DET curv
e and
EE
R
res
u
l
t
s i
n
Ta
b
l
e
6

a
n
d
Fi
gure 7

show
t
h
e
p
o
t
e
n
tia
l
o
f
the
p
r
op
ose
d

fus
i
on

of 3D

ei
ge
n-
sh
ap
e
an
d e
i
ge
n
-
t
e
x
t
u
r
e
f
eatu
r
e
s
wit
h
acou
s
t
i
c
f
eatu
r
e
s

(MFCC+
F0
)
for 3MF
approach
to
t
h
w
a
r
t
i
m
po
st
or
a
n
d

repla
y
at
tacks
for V
i
dTI
M
I
T

data

an
d A
V
O
ZES

data

w
itho
u
t

pose

a
n
d
il
lum
i
nat
i
on n
o
rm
a
liza
t
i
on.
F
o
r
V
i
dTI
M
IT cor
pus,
less
tha
n

1%
EE
R ac
hie
v
e
d
, w
ith

0.92
% for l
a
te fusi
on an
d
0.
6
4
%

for

fe
at
ure fus
i
on.


Table 6:
N
u
m
b
e
r
of
C
l
i
e
nt
,

I
m
p
o
s
t
or
an
d R
e
p
l
a
y

attack t
r
ials for

3
M
F
modul
e





Fi
gure
7:
DE
T cur
v
es
fo
r Typ
e
-1
TD
t
e
sts, (
a
)
mal
e

subj
ects
in
Vi
dTIMIT
,
(
b
)
fem
a
le
su
bjects
i
n

U
C
B
N



Table
7
:

EERs for
ty
pe-
2
re
play attac
k
s


F
e
ature
fus
i
on
per
f
orm
s
bet
t
er
, a 30%
im
pr
o
v
em
ent a
s

com
p
are
d
t
o
l
a
te

f
u
s
i
o
n
,

du
e

to
sy
nc
hr
on
ou
s

proc
e
ssi
ng

of e
i
ge
n-s
h
ape
,
e
i
ge
n-tex
t
ure
an
d a
c
o
u
s
tic
fea
t
ur
es.
F
o
r
A
V
O
ZES
corpus, E
E
R

ac
hi
eve
d

is
1.24
%
w
i
t
h
fe
a
t
ure

fus
i
on
a
s
co
m
p
a
r
ed to
1.53 %
,
abo
u
t
20% EE
R
i
m
p
r
ov
emen
t
.

Fo
r
ty
p
e
-
1
r
e
pl
ay a
ttac
k
s, less
tha
n
1 %
EE
R

i
s
a
c
h
ie
ve
d
for
V
i
dTI
M
IT a
nd A
V
O
ZES
, w
ith
fe
at
ure-
fus
i
on

perform
ing
be
tter
tha
n
la
te
fu
si
on (
48%
im
p
r
ovem
e
n
t
for

V
i
dTI
M
IT da
ta
vs. 3
8
%
for
A
V
O
ZES
da
ta).
Less tha
n

7% EE
R i
s
ac
hi
e
v
e
d
f
o
r
t
y
pe
-2
re
pl
a
y

a
tta
cks for
bo
t
h

V
i
dTI
M
IT a
nd A
V
O
ZES
data,
w
ith
best
EER eq
ua
l

to

1.9% for

V
i
dT
IMI
T

TI
M
O

da
t
a
an
d w
o
rst
EER o
f
6.4
5
%

f
o
r AVOZES
TDFO

d
a
t
a
. T
h
e
f
u
si
on
of
ac
o
u
s
ti
c f
eatu
r
es

w
ith
three
d
i
me
ns
i
o
na
l
s
h
a
p
e
and
te
x
t
ur
e
fea
t
ure
s

a
l
l
o
w
e
d

a sign
ific
a
n
t
l
y
bet
t
er pe
rform
anc
e
, a
nd
r
o
bu
st
ness to
pose
and ill
umi
n
a
t
i
on
va
ria
t
i
o
ns,

th
o
u
g
h

ty
pe-2
replay attac
k
s
are

m
o
re
com
p
lex r
e
p
l
ay
at
t
a
c
k
s t
o
de
tec
t
.

5

Conclu
sion
s


In this pa
per

w
e
show
t
h
e p
o
t
e
nt
i
a
l
o
f
m
u
l
t
i-
m
odal
fus
i
on

fra
m
e
w
ork w
i
t
h

se
vera
l
ne
w
fea
t
ur
es an
d d
i
ffer
en
t fus
i
on

t
e
c
hni
qu
e
s

for b
i
o
m
et
ri
c

p
e
rson
a
u
th
en
ti
c
a
ti
on
an
d
live
n
ess ver
i
f
i
c
a
ti
o
n
.

F
o
r
t
h
e
B
M
F

m
odu
l
e
, fea
t
ur
e
leve
l f
u
si
on
of a
u
d
i
ov
i
s
ua
l

fea
t
ure vec
t
or
s subs
ta
n
t
i
a
l
l
y
i
m
pr
oves
t
h
e
p
e
rfor
ma
nce of
a
f
a
c
e
-
v
o
i
ce
au
th
en
ti
cati
o
n
sy
st
em f
o
r
ch
e
c
k
i
ng
l
i
v
e
n
e
ss
and
t
h
w
a
rt
in
g re
pl
a
y

a
tta
cks.
A
l
so
,
t
h
e

sen
s
i
tiv
it
y
o
f

t
h
e
BMF

modu
le t
o

var
i
a
t
i
ons i
n

the
siz
e
of t
h
e tra
i
ni
n
g

da
ta
has be
e
n
r
ecog
n
ize
d
.
F
o
r t
h
e
CM
F
m
odu
l
e
, the
t
w
o
new

c
r
oss-moda
l fea
t
u
r
es,

LS
A

and C
C
A
have
t
h
e
p
o
w
e
r to
thw
a
r
t

ty
p
e
-
2
replay
attac
k
s.
A
b
o
u
t 4
2
%

ove
ral
l
impr
o
v
e
m
en
t
in e
rror
rate

w
ith C
C
A
fe
atures an
d 6
1
%

im
provem
e
nt w
i
t
h
LS
A
f
e
atu
r
e
s
i
s

ac
hi
e
v
e
d

as c
o
mp
are
d
t
o
f
eatu
r
e
-
l
e
v
e
l
f
u
sio
n

of im
age
-
P
C
A
a
nd
MF
C
C
fa
ce-
vo
i
c
e
fea
t
ur
e ve
ctors.

F
o
r 3MF
module,
fea
t
u
r
es
base
d o
n
t
h
ree
-
dime
ns
i
o
na
l
fac
e
m
ode
l
i
ng
per
f
orm
be
tter
aga
i
ns
t
impo
st
or an
d rep
l
a
y

attac
k
s.

T
h
e
m
u
l
t
im
oda
l
fe
ature fus
i
on o
f
ac
o
u
st
ic,

3D

sha
p
e an
d te
xture

fea
t
ure
s

a
l
l
o
w
e
d a
n
im
pro
v
em
ent
of 25-
40
%
o
v
e
r
CM
F fe
atures,

w
i
th

les
s
tha
n
1
%
for
ty
p
e
-
1

repla
y
at
tac
k
s
and le
ss
t
h
a
n

7% E
E
R,

a

sig
n
ific
a
n
t
l
y
bet
t
er perf
o
r
ma
nce for mor
e

di
ffic
u
l
t

ty
p
e
-
2
replay
attac
k
s.


6

Re
fe
re
nc
es

Auck
ent
h
aler,
R.,E
.P
aris, an
d M
.
Carey
,

I
mpro
vi
ng
GMM
S
p
eaker verif
i
c
a
ti
on S
y
stem

b
y

P
h
onet
i
c
W
e
ig
hting

,
Pr
oceedi
n
g
s
ICA
SSP
’99
, pp
. 14
40
-14
4
4
,
19
99
.
Bo
rga, M
., H
.
K
n
ut
ss
on,
“A
n adapt
i
v
e

stereo alg
o
rit
h
m
based

o
n

cano
n
ical
corr
el
a
t
io
n an
aly
s
is
”,
P
r
oceed
in
gs of the
S
econ
d

IEEE Int
e
rna
t
i
o
nal
C
on
f
e
rence
on Intelligent
P
r
ocessing

S
y
s
t
e
m
s,
pp.
17
7
-
1
82,
Au
gust,
1998
.

C
h
e
u
ng
,
M.C.
,
K.K. Yiu
,
M
.
W.
M
a
k, a
n
d
S.
Y
.
Ku
ng
, “
M
ul
ti-
sam
p
l
e
f
u
sion
with
co
ns
train
e
d
f
eatu
r
e
transform
a
tion

f
o
r
rob
u
st
s
p
eak
er v
e
rif
i
catio
n”,
Pr
oceedi
n
g
s
Od
yssey’0
4

Con
f
eren
ce.

Ch
ett
y
,

G
.
and
W
a
gn
er,
M.
,
"’Li
venes
s
’ V
e
rificat
io
n in

Au
di
o-
Vi
deo Au
thenti
c
a
ti
on”,

Pr
oc.
In
t
Con
f

o
n
Sp
oken L
ang
ua
ge

Pro
c
e
ssing
ICSL
P-0
4
,

Je
ju
, Kore
a
,
pp
25
09
-2
51
2.

Ch
ett
y
, G
.
and
W
a
g
n
er,
M.
,

A
uto
m
a
t
ed

li
p fea
t
ure extract
io
n

f
o
r liv
e
nes
s
v
e
rifi
catio
n in
audio
-
vi
deo
aut
h
enticat
io
n
”,
P
r
oc
.
Imag
e
an
d

V
i
s
i
o
n

C
o
mpu
t
ing
20
04
, Ne
w
Ze
alan
d
,
pp
17
-22.

Deerwes
t
er, S.,

Du
mais, S
.
T.,

Frun
as, G
.
W.,

La
n
d
auer,
T.K.,
and
H
a
rs
hm
an
,
R.

Ind
e
xi
ng
by
L
a
tent
S
e
m
a
ntic
An
aly
s
is
,
Jou
r
na
l A
m
erican
So
ciety f
o
r
Info
rm
a
tion Sci
.
,
2
001
,
41
(6),
3
91-4
0
7
.

Goecke, R.,
and J.
B
.
M
illar,

“The Aud
i
o-V
i
deo Aust
ral
i
an

En
glish

S
p
eech

Dat
a
Corpu
s
AVO
ZES

,

Pr
oceedi
n
g
s
o
f

t
h
e

8
th
In
tern
at
io
nal Co
nf
erence
on

Sp
oken
L
a
n
g
u
a
g
e

Pr
ocess
i
n
g

INTE
RSP
E
ECH

2004 - ICSLP,
V
o
l
u
m
e
III, pages 2525-
25
28
,
4-8

Oct
ober 20
04
.

Gord
on
,

G
.
,
“F
ac
e Re
cog
n
it
io
n f
r
om

F
r
o
n
tal and P
r
of
ile

Vi
ews
,
” P
r
ocee
di
ng
s
In
t’l
W
o
rksh
op
on
Fa
ce
and
Ges
t
u
r
e
Ges
t
ur
e Reco
gn
it
ion,
Z
u
r
ich
,
19
9
5
,
p
p
.
4
7-52.

Hsu
, R.L. and
A.K.J
a
in
,
“F
ace
Mo
del
i
n
g

f
o
r Reco
gn
it
ion,”

Pr
oceedi
n
g
s
In
t’l Co
nf. Ima
g
e P
r
oces
si
ng, ICIP, Greece,
Oct
.
7-
10,
20
01
.
Ki
tt
le
r,
J.
, G.
M
a
ta
s, K.
Jo
nsso
n
,
a
n
d M
.
Sa
nc
h
e
z
,
“Com
b
i
ning

evi
d
ence i
n
p
e
r
s
onal
id
enti
ty
v
e
rifi
cati
on s
y
s
t
e
m
s,”
P
a
ttern

Reco
gni
ti
on

L
e
tters
,
vo
l
.
18,
n
o
.
9
,p
p.
8
45–
85
2,
S
e
pt.
1
997
.

Po
h,
N
.
, and J
.
K
o
rczak, "Hyb
rid
bi
o
m
et
ric p
e
rso
n

aut
h
ent
i
catio
n u
s
in
g
f
ace and
v
o
i
ce f
eatu
r
es,"
Pr
oc.
of Int.

Con
f
.
on

A
udio
a
nd Vid
e
o-Ba
sed

B
i
ometr
i
c Per
s
o
n

Au
thenti
catio
n
,
H
a
l
m
s
t
ad,
S
w
ed
en
,
Ju
ne 2
0
0
1
,
p
p
.
34
8--3
53.


Reynolds
,
D
.,
T.
Quatieri

and
R. D
u
n
n
,

S
p
eaker V
e
rificatio
n

Us
ing
Adap
ted

G
a
uss
i
an
Mi
x
t
u
r
e M
o
del
s
”,
Digi
ta
l
Sig
n
a
l

Pr
ocessing
,
Vo
l
.
1
0
,
No.
1-3
,
2
000,
p
p
.
19
-41.

Ro
ss
, A.
,
and Jain
,
A.
K.
, “
I
nf
o
r
m
a
ti
on

fu
si
on

in

b
i
om
etri
c
s
”,
Pa
tt
e
r
n Recog
n
iti
on L
e
tt
e
r
s
24
,
1
3

(
S
ept.
20
03
)
,

2
1
15
–21
25
.

San
d
ers
o
n
,
C. a
n
d
K.
K
.
P
a
liwa
l
(20
0
3
)
, “F
ast

featu
r
es
f
o
r
f
ace

aut
h
ent
i
catio
n u
n
der il
lum
i
nation di
rectio
n chan
ges

,

Pattern
Reco
gni
ti
on
L
e
tters
24
, 24
09
-24
1
9
.

Yehi
a,
H
.
,
Rubi
n,
P.
and
Vat
i
kiotic-Bat
eso
n
E
.
(1
998
),

Q
u
a
n
t
i
t
a
t
iv
e as
so
ciatio
n of
v
o
cal
tract an
d f
acial
beh
a
vi
or”,
Jou
r
na
l o
f
Sp
e
ech

Com
m
u
n
i
c
ati
o
n
26(1-2), 23-43.

Yeh
i
a H
a
ni,

Tak
aaki
Ku
ratate
,
Eri
c
Vati
kiotic-Bat
eson,


L
i
n
k
i
ng
F
acial
An
im
ation
,

H
ead M
o
tio
n an
d
Sp
ee
ch

Acoustics

,
Journ
a
l of Ph
on
e
t
i
c
s
,
Vol.
30
, Issu
e
3,
20
02.