Auditory Supplements to Speechreading

joinherbalistAI and Robotics

Nov 17, 2013 (3 years and 10 months ago)

59 views


Proceedings of the Institute for Electronics, Information and Communication Engineers
(IEICE)
and The Acoustical Society of Japan, Speech
Dyn
amics by Ear, Eye, Mouth and Machine: An Interdisciplinary Workshop
, Kyoto, Japan, 27 June, 2003
.

1

Auditory Supplements to Speechreading

Ken W. Grant

Ar my Audi ol ogy and Speech Cent er, Wal t er Reed Ar my Medi cal Cent er, Washi ngt on DC
20307
-
5001

E
-
mai l: gr ant @t i dal wave.net


Audi t or y
-
vi s ual speech r ecogni t i on i s f ar mor e accur at e and r obust t han speech r eco
gni t i on by hear i ng al one.
Yet, i n s pi t e of t he benef i t s and obvi ous i mpor t ance of audi t or y
-
vi s ual speech f or ever yday communi cat i on,
l i t t l e i s known about t he mechani s ms i nvol ved i n audi t or y
-
vi s ual speech i nt egr at i on. As a pr el i mi nar y s t ep
t owar d t he devel
opment of a gener al i zed model of s peech communi cat i on t hat i ncor por at es vi s ual speech cues,
i t i s neces s ar y t o del i neat e t he spect r al and t empor al i nt er act i ons t hat occur when vi sual speech cues ar e us ed
i n t andem wi t h acoust i c cues. I t wi l l be shown t hat
t hi s i nt er act i on i s bot h hi ghl y s yner gi s t i c and non
-
l i near.
Fur t her, i t i s sugges t ed t hat vi sual s peech cues may s er ve as a gui de f or audi t or y s peech pr oces s i ng by
i nf or mi ng t he l i s t ener of spect r al and t empor al l andmar ks t hat can be us ed t o decode t he s p
eech mes s age.


Keyword
s

Speechreadi ng, Audi t or y
-
Vi sual Speech Per cept i on, Audi t or y
-
Vi sual Int egr at i on,
Int el l i gi bi l i t y Model


1.

Int roduct i on

1.1.

Acoust i c cues that suppl ement
speechreadi ng

The i mpor t ance of vi s ual cues f or under s t andi ng
spoken l anguage has be
en known f or s ome t i me [ 1].
Much of t he ear l y wor k f ocus ed on t he speci f i c
needs of pr of oundl y hear i ng
-
i mpai r ed pat i ent s
who r el y on s peechr eadi ng as t he pr i mar y means
f or decodi ng s poken l anguage. When
speechr eadi ng i s us ed as t he sol e channel f or
r ecei vi
ng speech, a number of i mpor t ant s egment al
and supr as egment al speech f eat ur es ar e l os t ( e.g.,
voi ci ng, nas al i t y, and i nt onat i on), t hus r es t r i ct i ng
t he r at e and accur acy of communi cat i on t o r oughl y
40% of t hat of a nor mal
-
hear i ng i ndi vi dual [ 2]. On
t he ot he
r hand, when speechr eadi ng i s combi ned
wi t h i nf or mat i on f r om ot her s ensor y channel s
( audi t or y or t act i l e), l os t i nf or mat i on can of t en be
r ecover ed, especi al l y i f t he i nf or mat i on suppl i ed
by t he ot her s ens or y channel s compl ement s t he
i nf or mat i on pr ovi ded v
i s ual l y. An exampl e of such
a combi nat i on i s s hown i n Tabl e 1. The el even
vi s ual cat egor i es or "vi s emes" s hown i n t he t op
r ow of Tabl e 1 wer e obt ai ned f r om an anal ys i s of
er r or pat t er ns made by t r ai ned nor mal
-
hear i ng
speechr eader s [ 3]. Consonant s bel ongi ng

t o
di ff er ent cat egor i es wer e s el dom conf us ed ( e.g.,
/b/ vs /t/), whi l e consonant s bel ongi ng t o t he s ame
cat egor y wer e f r equent l y conf us ed ( e.g., /b/ vs /p/).
The s ubs equent t hr ee r ows i n t he Tabl e s how what
woul d be expect ed t o happen i f i nf or mat i on about

voi ci ng, nas al i t y, and af f r i cat i on wer e pr ovi ded
Tabl e 1. Li ngui s t i c f eat ur e cont r i but i ons t o vi s ual s peech r ecogni t i on. The t op r ow r epr es ent s
t ypi cal f eat ur
e cl as s i f i cat i ons f or s peechr eadi ng al one ( vi s emes ). Each subs equent r ow r epr es ent s t he
ef f ect s of addi ng i nf or mat i on about anot her l i ngui s t i c f eat ur e vi a an addi t i onal i nput channel ( i n t hi s
cas e audi t or y). Not e t hat as addi t i onal f eat ur es ar e added, cons
onant conf us i ons as soci at ed wi t h
speechr eadi ng ar e r esol ved t o a gr eat er and gr eat er ext ent. Adapt ed f r om [ 3].















































































































































Speechreading
+
Voicing
+
Nasality
+
Affrication

Proceedings of the Institute for Electronics, Information and Communication Engineers
(IEICE)
and The Acoustical Society of Japan, Speech
Dyn
amics by Ear, Eye, Mouth and Machine: An Interdisciplinary Workshop
, Kyoto, Japan, 27 June, 2003
.

2

f r om s ome ot her s ensor y channel and combi ned
wi t h speechr eadi ng. As can be s een, t he addi t i onal
i nf or mat i on compl et el y r esol ves al l r emai ni ng
ambi gui t i es, t hus l eadi ng, i n t heor y, t o per f ect
r ecogni t i on.

1.2.

Th
e search f or mi ni mal acoust i c
suppl ement s

I n or der t o t r ans mi t i nf or mat i on t hat
compl ement s s peechr eadi ng t o pr of oundl y deaf
i ndi vi dual s, r es ear cher s had t o come t o gr i ps wi t h
a per pl exi ng pr obl em, namel y, t hat t he
i nf or mat i on
-
handl i ng capaci t y of r es i dual

audi t or y f unct i on i n deaf pat i ent s, or of t he t act i l e
s ys t em, i s gr eat l y r educed compar ed t o nor mal
hear i ng [ 4]. As a r es ul t, nat ur al s peech si gnal s
coul d not be di r ect l y t r ans mi t t ed t o t hes e channel s
and some f or m of codi ng was r equi r ed. I n ot her
wor ds,
i t became neces s ar y t o f i nd acous t i c and/or
t act i l e s uppl ement s t o speechr eadi ng t hat wer e 1)
capabl e of conveyi ng i nf or mat i on about voi ci ng,
nas al i t y, af f r i cat i on, and ot her s peech f eat ur es
t hat wer e not r eadi l y t r ans mi t t ed vi a
speechr eadi ng, and 2) wer e
s i mpl e enough t o be
pr oces s ed eff ect i vel y by t he r ecei vi ng modal i t y.
Thi s appr oach r es ul t ed i n a number of
demonst r at i ons whi ch s howed t hat cer t ai n
acous t i c si gnal s, whi ch by t hems el ves wer e mos t l y
uni nt el l i gi bl e, coul d never t hel es s l ead t o ver y
hi gh i nt el
l i gi bi l i t y s cor es when combi ned wi t h
speechr eadi ng [ 2, 5, 6]. For exampl e, Gr ant, et al.
[ 2] meas ur ed t he cont r i but i on of audi t or y
s i newave anal ogs r epr es ent i ng var i ous speech
f eat ur es, s uch as ampl i t ude
-
envel ope and
f undament al
-
f r equency i nf or mat i on, t o
s
peechr eadi ng. The acous t i c si gnal s wer e pur e
t ones modul at ed i n f r equency ( FM), ampl i t ude
( AM), or bot h ( AMFM) bas ed on an anal ys i s of t he
ampl i t ude and f r equency of t he voi ce f undament al.
Speech under st andi ng was eval uat ed us i ng t he
connect ed di s cour s e t r
acki ng pr ocedur e [ 7] whi ch
i nvol ves a s peaker r eadi ng al oud f r om t ext and t he
r ecei ver r epeat i ng ver bat i m what t he speaker has
s ai d. The s es s i ons ar e t i med and t he r es ul t s
expr es s ed as t he number of cor r ect l y r epr oduced
wor ds per mi nut e ( WPM), or as a per c
ent of t he
nor mal
-
hear i ng t r acki ng r at e ( r oughl y 110 WPM).
Res ul t s ar e di spl ayed i n Fi gur e 1, and show cl ear l y
t hat t he r ecept i on of connect ed s peech i s i mpr oved
dr amat i cal l y wi t h acous t i c s i gnal s t hat by
t hems el ves have al mos t zer o i nt el l i gi bi l i t y. For
ex
ampl e, t he t r acki ng r at e i ncr eas ed f r om r oughl y
37% f or speechr eadi ng al one ( SA) t o al mos t 68%
f or ei t her AM or FM t ones. A f ur t her i ncr eas e t o
near l y 80% of t he nor mal t r acki ng r at e was
obs er ved when ampl i t ude
-
envel ope and
f undament al
-
f r equency i nf or mat i o
n wer e
combi ned, as i n t he AMFM condi t i on or a l owpas s
f i l t er ed s peech condi t i on ( LPF) wi t h a cut of f
f r equency of 300 Hz.

1.3.

A chal l enge f or model s of speech
Fi gure 1.

Connect ed di s cour s e t r acki ng r at es
f or speechr eadi ng al one (
SA), and f or
speechr eadi ng pl us ampl i t ude modul at ed t one
( AM), f r equency
-
modul at ed t one ( FM),
ampl i t ude
-

and f r equency
-
modul at ed t one
( AMFM), and l owpas s
-
f i l t er ed s peech ( LPF).
Repr i nt ed f r om [ 2].

SA
AM
FM
AMFM
LPF
0
20
40
60
80
100
0
10
20
30
40
50
60
70
80
90
Tracking Rate (WPM)
Tracking Rate (% re: Normal)
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
Filter Condition
A
AV
1
2
3
4
5
6
7
8
9
10
11
12
1. 250-505
2. 645-955
3. 1130-1515
4. 1720-2140
5. 2600-3255
6. 4200-5720
7. 250-795
8. 1130-1930
9. 3255-5720
10. 250-1130
11. 1130-2355
12. 2600-5720
Condition
Percent Correct
Fi gure 3
. Audi t or y and audi t or y
-
vi s ual speech
i nt el l i gi bi l i t y as a f unct i on of f i l t er band. Not
e, t hat
even t hough some bands have gr eat er audi t or y
i nt el l i gi bi l i t y t han ot her s ( e.g., band 6 ver sus band
1), t hei r audi t or y
-
vi s ual i nt el l i gi bi l i t y i s not as
gr eat. Adapt ed f r om [ 15].


Proceedings of the Institute for Electronics, Information and Communication Engineers
(IEICE)
and The Acoustical Society of Japan, Speech
Dyn
amics by Ear, Eye, Mouth and Machine: An Interdisciplinary Workshop
, Kyoto, Japan, 27 June, 2003
.

3

i ntel l i gi bi l i t y

Thes e dat a, al ong wi t h t hos e f r om s i mi l ar
s t udi es t hat us e non
-
t r adi t i onal acoust i c

si gnal s as
suppl ement s t o speechr eadi ng [ 5, 6, 8, 9], pos e a
s er i ous chal l enge f or model s of speech
i nt el l i gi bi l i t y whi ch bas e t hei r pr edi ct i ons sol el y
on phys i cal at t r i but es of t he s i gnal, speaker,
l i s t ener, and t he l i s t eni ng envi r onment [ 10, 11, 12].
Pr
edi ct or s of speech i nt el l i gi bi l i t y s uch as t he
Ar t i cul at i on I ndex ( AI ) or t he Speech
Tr ans mi s s i on I ndex ( STI ), ei t her i gnor e t he r ol e of
vi s ual s peech cues al t oget her ( STI ), or t r eat t he
vi s ual channel as an
i ndependent

sour ce of s peech
i nf or mat i on t hat s i
mpl y adds t o t he audi t or y
i nf or mat i on ( AI ). I n t he cas e of t he AI, t hi s
r el at i vel y s i mpl i st i c vi ew i s mos t l i kel y i ncor r ect
i n t hat i t does not al l ow f or t he pos s i bi l i t y of
audi t or y
-
vi s ual i nt er act i ons. For exampl e, i n t he
1969 ANSI s t andar d f or cal cul at i
ng t he
Ar t i cul at i on I ndex [ 13], a gr aphi cal cor r ect i on t o
t he audi t or y AI was us ed when vi s ual cues wer e
pr es ent. Thi s cor r ect i on cur ve i s s hown i n Fi gur e
2. As i ndi cat ed by t he f i gur e, t he audi t or y
-
vi s ual
AI i s s i mpl y a f unct i on of t he audi t or y AI,
r egar d
l es s of any di f f er ences ( s pect r al or
t empor al ) t hat mi ght exi s t among acoust i c s i gnal s.
Thus, f or exampl e, a cal cul at ed audi t or y AI of 0.2
woul d al ways be equi val ent t o an ef f ect i ve
audi t or y
-
vi s ual AI of 0.35. For l ow
-
cont ext
s ent ence mat er i al s, t hi s
ef f ec
t i ve

i ncr eas e i n AI
t r ans l at es t o an i ncr eas e i n i nt el l i gi bi l i t y f r om
r oughl y 50% wor ds cor r ect t o 90% wor ds cor r ect
[ 14].

The over ar chi ng as s umpt i on of t he ANSI 1969
AI s t andar d vi s à vi s audi t or y
-
vi s ual s peech
r ecogni t i on i s t hat vi s ual cues benef i t s pe
ech
i nt el l i gi bi l i t y equal l y r egar dl es s of t he spect r al
cont ent of t he acous t i c speech s i gnal. Thi s
as s umpt i on was t es t ed di r ect l y by Gr ant and
Wal den [ 15]. The audi t or y condi t i ons consi s t ed of
/

/
-
cons onant
-
/

/ ( aCa) t okens pr oces s ed t hr ough
t wel ve di ff er e
nt bandpas s f i l t er s of var yi ng
bandwi dt h and cent er f r equenci es. The r es ul t s
showed t hat t her e was l i t t l e r el at i ons hi p bet ween
over al l audi t or y i nt el l i gi bi l i t y and over al l
audi t or y
-
vi s ual i nt el l i gi bi l i t y ( Fi gur e 3). Fur t her
i ns pect i on of t hes e dat a r eveal e
d t hat
l ow
-
f r equency bands ( e.g., 250
-
505 Hz) t ended t o
pr ovi de much mor e benef i t t o s peechr eadi ng t han
mi d
-
f r equency ( e.g., 2800
-
3255 Hz) or
hi gh
-
f r equency ( e.g., 4200
-
5720 Hz) bands. An
i nf or mat i on anal ys i s [ 16] of t he consonant er r or
pat t er ns s howed t ha
t t he i nf or mat i on conveyed by
speechr eadi ng
-
al one was al mos t compl et el y
r es t r i ct ed t o pl ace of ar t i cul at i on ( i.e., l i t t l e or no
t r ans mi s s i on of voi ci ng or manner
-
of
-
ar t i cul at i on
i nf or mat i on). Fur t her mor e, audi t or y bands whi ch
conveyed a r el at i vel y hi gh deg
r ee of
pl ace
-
of
-
ar t i cul at i on i nf or mat i on pr ovi ded t he
l eas t amount of benef i t when combi ned wi t h
speechr eadi ng. I n ot her wor ds, when t he audi t or y
and vi s ual channel s cont ai ned s i mi l ar ar t i cul at or y
f eat ur e i nf or mat i on ( i.e., t he t wo channel s wer e
mos t l y
red
undant

wi t h r espect t o each ot her ),
l i t t l e audi t or y
-
vi s ual benef i t was obt ai ned. I n
cont r ast, when t he audi t or y channel conveyed a
r el at i vel y hi gh degr ee of i nf or mat i on about
consonant voi ci ng and consonant manner of
ar t i cul at i on ( i.e.,
compl ement ar y

i nf or
mat i on
r el at i ve t o s peechr eadi ng), t he audi t or y
-
vi s ual
benef i t was ver y hi gh. Thus, i n di r ect
cont r adi ct i on t o t he as s umpt i ons made i n t he ANSI
St andar d, di ff er ent audi t or y condi t i ons wi t h t he
s ame AI need not have t he s ame audi t or y
-
vi s ual AI.
Thes e f i ndi n
gs ar e cons i st ent wi t h var i ous model s
of audi t or y
-
vi s ual i nt egr at i on [ 17, 18, 19] and
demonst r at ed t hat, at l eas t f or cons onant s, t he
Fi gure 2
. ANSI 1969 correct i on curve f or
es t i mat i ng t he
eff ect i ve Art i cul at i on Index (AI)
wi t h vi sual cues. Repri nt ed f rom [13].

0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Calculated

AI
ANSI, 1969
Effective AI With Visual Cues

Proceedings of the Institute for Electronics, Information and Communication Engineers
(IEICE)
and The Acoustical Society of Japan, Speech
Dyn
amics by Ear, Eye, Mouth and Machine: An Interdisciplinary Workshop
, Kyoto, Japan, 27 June, 2003
.

4

amount of benef i t pr ovi ded by combi ni ng audi t or y
and vi s ual s peech cues i s det er mi ned pr i mar i l y by
t he degr ee of ar t i cul at
or y
-
f eat ur e r edundancy
bet ween t he t wo channel s.

The dat a f r om Gr ant and Wal den [ 14] s t r ongl y
sugges t t hat vi sual s peech cues have a
wei ght ed

i nf l uence on t he per cept i on of audi t or y cues,
dependi ng on t he s pect r al cont ent of t he acoust i c
speech s i gnal. Mor
eover, t he amount of benef i t
pr ovi ded by vi s ual s peech cues f or nons ens e
s yl l abl e r ecogni t i on can be pr edi ct ed f ai r l y
accur at el y by det er mi ni ng t he degr ee of
compl ement ar i t y bet ween t he audi t or y and vi s ual
channel s [ 14, 17, 18, 19].

I n r es pons e t o t hes e
o
bs er vat i ons, t he r evi s ed s t andar d f or cal cul at i ng
t he Ar t i cul at i on I ndex ( r ef er r ed t o as t he Speech
I nt el l i gi bi l i t y I ndex, or SI I ) has r emoved t he
gr aphi cal audi t or y
-
vi s ual cor r ect i on pr ocedur e,
and l i mi t ed t he s cope t o condi t i ons t hat do not
i ncl ude mul t i
pl e, shar pl y f i l t er ed bands of s peech,
shar pl y f i l t er ed noi s e, or acoust i c s i gnal s whi ch
ar e not t ypi cal of nor mal speech ( e.g., s i ne wave
speech).

1.4.

Auditory
-
visual interactions in ti me
and frequency

The s t udi es des cr i bed above have hel ped
i mpr ove our under
s t andi ng of s ome of t he var i ous
per cept ual f act or s i nvol ved i n audi t or y
-
vi s ual
speech r ecogni t i on, whi l e at t he s ame t i me
expos i ng cer t ai n weaknes s es and l i mi t at i ons i n
model s of s peech i nt el l i gi bi l i t y i n gener al.
However, t hey do not speak di r ect l y t o t he

mechani s ms or pr oces s es t hat ar e us ed dur i ng
audi t or y
-
vi s ual s peech r ecogni t i on. Speci f i cal l y,
how do l i st ener s r el at e what t hey s ee on t he l i ps t o
what t hey hear?
Summer f i el d [ 20] hypot hes i zed
t hr ee pos s i bl e r ol es f or vi sual cues i n i mpr ovi ng
speech und
er s t andi ng i n noi s e. The t wo mos t
appar ent of t hes e ar e t o pr ovi de s egment al ( e.g.,
consonant s and vowel s ) and s upr as egment al ( e.g.,
i nt onat i on, s t r es s, r hyt hmi c pat t er ni ng, et c.)
i nf or mat i on whi ch i s 1) r edundant t o cues
pr ovi ded acous t i cal l y, and 2) comp
l ement ar y t o
cues pr ovi ded acous t i cal l y. As al r eady di s cus s ed,
t he gr eat es t benef i t s occur when s peechr eadi ng
and audi t i on pr ovi de compl ement ar y f eat ur e
i nf or mat i on. I n noi s y and r ever ber ant
envi r onment s, or f or i ndi vi dual s wi t h hear i ng
i mpai r ment, many of

t he r el evant acous t i c
at t r i but es t hat l ead t o t he i dent i f i cat i on of
phonet i c uni t s may be ver y weak, abs ent, or
di s t or t ed [ 21]. Under t hes e condi t i ons, t her e i s
s i gni f i cant ambi gui t y i n t he audi t or y channel, i n
par t i cul ar wi t h r egar d t o pl ace
-
of ar t i cul at
i on.
When audi t i on and speechr eadi ng ar e combi ned,
however, a subs t ant i al pr opor t i on of pl ace cues ar e
r es t or ed t hr ough speechr eadi ng and t he i nt egr at ed
audi t or y
-
vi s ual per cept i s f ar mor e compl et e t han
t hat obt ai ned f r om ei t her of t he uni modal s our ces
al o
ne.

The t hi r d r ol e of s peechr eadi ng hypot hes i zed by
Summer f i el d ( 1987) per t ai ns t o t he
spect r o
-
t empor al r el at i ons t hat exi s t bet ween
vi s i bl e movement s of a s peaker's ar t i cul at or s and
t he acous t i c s peech si gnal. When a l i s t ener
wat ches a t al ker s peak, t he
acous t i c s i gnal and t he
vi s i bl e movement s of t he t al ker's l i ps shar e
common spat i al, t empor al, and s pect r al pr oper t i es
whi ch hel p s egr egat e t he speech s i gnal of i nt er es t
f r om t he s ur r oundi ng backgr ound noi s e. Di r ect
meas ur ement s of t he di s pl acement s of t he

upper
and l ower i nner mar gi ns of t he l i ps at mi dl i ne, or
of t he ar ea of l i p openi ng, have been s hown t o be
r el at ed t o t he over al l ampl i t ude cont our of t he
speech s i gnal [ 22]. Fur t her meas ur es have shown
t hat t he cor r el at i on bet ween t he ar ea of l i p
openi ng

and acous t i c envel ope dynami cs al so
depends on t he s pect r al r egi on of t he acous t i c
s i gnal, wi t h t he hi ghes t cor r el at i on obs er ved f or
acous t i c si gnal s wi t h ener gy concent r at ed i n t he
r egi on of t he s econd and t hi r d f or mant f r equenci es
[ 23].

One ps ychophys i c
al cons equence of t hi s
r el at i on bet ween s peech ki nemat i cs and
spect r al l y
-
s peci f i c acous t i c s peech envel opes i s
t hat when vi sual speech cues ar e pr es ent t her e i s a
r educt i on i n t he s pect r al and t empor al uncer t ai nt y
as s oci at ed wi t h t he ons et of s yl l abl es and

wor ds.
Recent s t udi es [ 22, 23] have shown t hat t hi s
r educt i on i n uncer t ai nt y can l ead t o i mpr oved
speech
det ect i on

t hr es hol ds i n noi s e, t hr ough a
pr oces s cal l ed bi modal comodul at i on mas ki ng
pr ot ect i on ( BCMP). I n ot her wor ds, wat chi ng t he
movement s of t he
l i ps dur i ng s peech pr oduct i on
can i nf or m t he l i s t ener not onl y
where

i n s pace and

Proceedings of the Institute for Electronics, Information and Communication Engineers
(IEICE)
and The Acoustical Society of Japan, Speech
Dyn
amics by Ear, Eye, Mouth and Machine: An Interdisciplinary Workshop
, Kyoto, Japan, 27 June, 2003
.

5

when

i n t i me t o l i s t en t o pr omi nent acous t i c event s,
but al so
where

i n t he acous t i c s pect r um t o expect
t he event s t o occur. The exper i ment al par adi gm
us ed i n t hes e s t udi es
wa
s a var i ant of t he
comodul at i on mas ki ng r el eas e par adi gm [ 24]. The
pr i mar y goal was t o det er mi ne i f comodul at ed
act i vi t y bet ween or of aci al ki nemat i cs and acous t i c
ampl i t ude envel ope l ed t o an i mpr ovement i n
speech det ect i on t hr es hol ds. Thr eshol ds f or
det ec
t i ng s ent ences i n noi s e wer e det er mi ned
under a var i et y of condi t i ons: audi t or y al one,
audi t or y
-
vi s ual wi t h mat chi ng ( congr uent ) vi deo,
f i l t er ed audi t or y
-
vi s ual speech wi t h congr uent
vi deo, and audi t or y
-
vi sual wi t h unmat ched
( i ncongr uent ) vi deo. For each
condi t i on, t he
degr ee of cor r el at i on bet ween ar ea of mout h
openi ng and audi t or y envel ope f l uct uat i ons was
det er mi ned. I n addi t i on, a cont r ol condi t i on us i ng
vi s ual or t hogr aphy t o i ndi cat e t he t ext of t he
t ar get audi o s ent ence pr i or t o each t es t t r i al. was
t es t ed

As s een i n Fi gur e 4, a s i gni f i cant mas ki ng
r el eas e f or det ect i ng s poken s ent ences ( 1
-
3 dB
dependi ng on t he s peci f i c t ar get audi o s ent ence)
was obs er ved when s i mul t aneous vi s ual s peech
i nf or mat i on was pr ovi ded. Ther e was no ef f ect on
audi t or y mas ki ng

when mi s mat ched ( i ncongr uent )
vi s ual speech i nf or mat i on was pr ovi ded ( not
shown). Resul t s of i nf or mi ng s ubj ect s as t o t he
i dent i t y of t he t ar get audi o s ent ence us i ng an
or t hogr aphi c di spl ay r es ul t ed i n a s mal l r el eas e
endent of
t he t ar get s ent ence, pr obabl y r ef l ect i ng a gener al
r educt i on i n s t i mul us uncer t ai nt y. Resul t s f or
f i l t er ed
-
s peech t ar get s cor r es pondi ng r oughl y t o
t he f i r s t ( 100
-
800 Hz) and s econd ( 800
-
2200 Hz)
f or mant
-
f r equency r egi ons s howed t hat
mi d
-
f r equency

s peech t ar get s pr oduced a mas ki ng
r el eas e equi val ent t o t hat of br oadband
unpr oces s ed speech, and l ow
-
f r equency s peech
t ar get s pr oduced s i gni f i cant l y

s ma l l e r a mount s o f
ma s ki ng r e l e a s e.

Thes e r esul t s s ugges t t hat t he vi s i bl e
modul at i ons of t he l i ps and j a
w dur i ng
speechr eadi ng make audi t or y det ect i on of s peech
eas i er by i nf or mi ng l i s t ener s about t he pr obabl e
spect r o
-
t empor al s t r uct ur e of a near
-
t hr es hol d
acous t i c s peech s i gnal. A cor r el at i on anal ys i s of
t he ar ea of openi ng of t he l i ps and t he acous t i c
enve
l ope modul at i ons of t he co
-
occur r i ng
s ent ence r eveal ed a pr edi ct i ve r el at i onshi p
bet ween t he degr ee of mas ki ng r el eas e and t he
s t r engt h of t he ar ea f unct i on/acous t i c envel ope
cor r el at i on. Speci f i cal l y, s ent ences wi t h a hi gh
cor r el at i on bet ween ar ea f unct i o
n and acous t i c
envel ope s howed mor e mas ki ng r el eas e t han
s ent ences wi t h a l ower cor r el at i on. Fur t her mor e,
meas ur es bet ween t he ar ea of l i p openi ng and
acous t i c envel ope modul at i ons ext r act ed f r om
speci f i c s pect r al r egi ons of s peech s howed t hat a
hi gher cor
r el at i on can be expect ed f or acous t i c
ener gy modul at i ons i n t he s econd ( F2) and t hi r d
( F3) f or mant r egi ons. Thi s cor r espondence i s
exact l y what one mi ght pr edi ct gi ven
speechr eader s' abi l i t i es t o ext r act pr i mar i l y
pl ace
-
of
-
ar t i cul at i on i nf or mat i on. I t i s w
el l
es t abl i shed t hat audi t or y pl ace
-
of
-
ar t i cul at i on
i nf or mat i on i s conveyed by cues cont ai ned
pr i mar i l y i n t he mi d
-
t o
-
hi gh f r equency r egi ons
( i.e., F2 and F3 r egi ons of s peech).

2.

Di scussi on and Concl usi on

Vi ewed col l ect i vel y, s t udi es des cr i bi ng t he
benef i t
s of audi t or y
-
vi s ual speech r ecogni t i on
us i ng mi ni mal l y i nt el l i gi bl e acoust i c s i gnal s, and
s t udi es of t he r el at i ons bet ween speech acoust i c
Condition
0
0.5
1
1.5
2
2.5
AV
WB
AV
F2
AV
F1
AV
O
Bimodal Masking Release (dB)
Fi gure 4
. Mas ked t hres hol d di ff erences, or
mas ki ng rel eas e (i n dB) f or audi t ory and
mat chi ng aud
i t ory
-
vi s ual condi t i ons. AV
WB

=
w i d e b a n d s p e e c h; AV
F2

a n d AV
F1
=
b a n d p a s s
-
f i l t e r e d s p e e c h ( s e e t e x t ); AV
O

=
o r t h o g r a p h i c a l l y c u e d s p e e c h. A d a p t e d f r o m
[ 2 2, 2 3 ].


Proceedings of the Institute for Electronics, Information and Communication Engineers
(IEICE)
and The Acoustical Society of Japan, Speech
Dyn
amics by Ear, Eye, Mouth and Machine: An Interdisciplinary Workshop
, Kyoto, Japan, 27 June, 2003
.

6

and ar t i cul at or y dynami cs s ugges t a t wo
-
t i er ed
appr oach t owar ds model l i ng audi t or y
-
vi s ual
i nt egr at i on. On t he one ha
nd, l i ngui s t i c
i nf or mat i on der i ved f r om audi t or y and vi s ual
speech pr oces s es combi ne s yner gi s t i cal l y, s uch
t hat i nf or mat i on di s r upt ed or l os t i n one channel
may be r ecover ed by t he ot her channel [ 17, 18, 19].
The bes t exampl e of t hi s per t ai ns t o
pl ace
-
of
-
a
r t i cul at i on i nf or mat i on, whi ch i s
ext r emel y vul ner abl e acous t i cal l y t o noi s e,
r ever ber at i on, and hear i ng l os s. Pl ace cues,
however, ar e f ai r l y r obus t vi sual l y and ar e
mi ni mal l y aff ect ed, i f at al l, by noi s e,
r ever ber at i on, and hear i ng l os s. Nat ur al l y, vi s u
al
pl ace cues ar e aff ect ed by ot her envi r onment al
f act or s s uch as l i ght i ng and vi ewi ng angl e, but
t hes e f act or s ar e r el at i vel y uni mpor t ant t o
audi t or y pr oces s i ng. Thi s compl ement ar y
ar r angement of acous t i c and vi sual speech cues
makes f or a r emar kabl y r obu
s t s i gnal and f or ms t he
basi s of mos t cur r ent model s of audi t or y
-
vi s ual
i nt egr at i on.

A s econd i mpor t ant way t hat vi s ual s peech
i nt er act s wi t h audi t or y speech pr oces s i ng i s
t hr ough t he cor r el at ed act i vi t y bet ween
movement s of t he f ace dur i ng s peech pr oduct i
on
and var i ous aspect s of t he speech ampl i t ude
envel ope. Thi s i nf or mat i on can be us ed by
l i s t ener s t o i nf l uence l ow
-
l evel audi t or y
pr oces s i ng of s peech. The vi s i bl e movement s of
or of aci al st r uct ur es dur i ng speech pr oduct i on
i nf or m l i s t ener s about when ( i n

t i me) t o expect
peak ampl i t udes i n t he acous t i c wavef or m, and
wher e ( i n t he acoust i c f r equency s pect r um) t o
expect t hes e peaks t o occur. Thus, by wat chi ng t he
f ace whi l e l i s t eni ng t o speech, t her e i s a
s i gni f i cant r educt i on i n s i gnal uncer t ai nt y t hat
enab
l es l i st ener s t o ext r act s i gnal s f r om noi s e at
S/N r at i os t hat ot her wi s e woul d be bel ow
t hr es hol d.

By cons i der i ng t he phys i cal coher ence bet ween
t he t wo sour ces of i nf or mat i on, i n addi t i on t o t hei r
r espect i ve l i ngui s t i c cont ent, i t becomes pos s i bl e
t o couc
h s ome of t he benef i t s of s peechr eadi ng i n
audi t or y
-
vi s ual speech pr oces s i ng i n t er ms of t he
act i vi t y of popul at i ons of mul t i s ensor y neur ons
havi ng par t i cul ar opt i mal s t i mul us ons et
as ynchr oni es. For exampl e, enhanced phys i ol ogi c
r espons es of mul t i s ens or y

neur ons pr es umabl y
t r ans l at e t o i ncr eas ed r eact i on s peeds of s uper i or
col l i cul us
-
medi at ed at t ent i ve and or i ent at i on
r espons es [ 25]. Thes e enhanced l evel s of
neur ol ogi c act i vi t y may pr ovi de gr eat er over al l
dr i ve t o hi gher
-
l evel audi t or y neur ons, whi ch i n
t
ur n al l ow f or r educed speech det ect i on t hr es hol ds.
Of cour s e, at t hi s poi nt, t hes e ar e onl y
specul at i ons. But t he l at es t ps ychophys i cal r es ul t s
on bi modal coher ence mas ki ng pr ot ect i on ar e at
l eas t consi s t ent wi t h r ecent phys i ol ogi cal f i ndi ngs.
I n addi t i on,

t hi s somewhat mor e phys i cal
i nt er pr et at i on of how s peechr ead cues can be us ed
t o gui de audi t or y anal ys i s of s peech s ugges t s new
s t r at egi es f or s i gnal pr oces s i ng i n t he ar eas of
aut omat i c s peech r ecogni t i on and aut omat i c noi s e
r educt i on i n hear i ng ai d desi
gn. For i ns t ance, i t
may be pos s i bl e t o us e t he cor r el at ed act i vi t y
bet ween opt i cs and acoust i cs i n speech pr oduct i on
t o f as hi on t empor al f i l t er s t hat can be us ed t o
ef f ect i vel y s egr egat e t ar get s peech component s
f r om i nt er f er i ng backgr ound noi s e or ot her
t al ker s.
Fur t her, t he f act t hat t he vi s i bl e movement s of t he
l i ps ar e oper at i ng on a t i me f r ame cons i st ent wi t h
s l ow
-
r at e acous t i c modul at i ons i n t he r ange
bet ween 0
-
30 Hz, s ugges t s a mor e uni f i ed
appr oach t o model l i ng audi t or y
-
vi s ual speech
pr oces s i ng. Th
e bas i c i dea of t hi s appr oach i s t o
t r eat t he vi s ual speech s i gnal as an addi t i onal
channel of ampl i t ude modul at i on t hat can augment
and gui de audi t or y modul at i on anal ys i s of s peech.
I n ot her wor ds, an audi t or y
-
vi s ual modul at i on
spect r um coul d be der i ved a
nd i nt er pr et ed i n much
t he s ame way as audi t or y modul at i on pat t er ns ar e
cur r ent l y i nt er pr et ed wi t hi n model s s uch as t he
Speech Tr ans mi s s i on I ndex [ 12]. Ther e i s a
gr owi ng l i t er at ur e demons t r at i ng t hat s peech
i nt el l i gi bi l i t y i s cr i t i cal l y dependent on t he
p
r es er vat i on of t hes e s l ow
-
r at e, s pect r o
-
t empor al
ampl i t ude modul at i ons, r ef l ect i ng t he dynami c
movement of t he s peech ar t i cul at or s [ 26, 27] as
wel l as var i at i ons i n s yl l abl e and phonet i c
dur at i on obs er ved i n conver s at i onal s peech [ 28].
Becaus e t he vi sual c
hannel can s er ve as anot her
sour ce of t hi s cr i t i cal i nf or mat i on, one t hat i s
r el at i vel y i mmune t o envi r onment al noi s e and
r ever ber at i on, i t s houl d pr ove i nval uabl e i n a hos t
of appl i cat i ons f r om model s of s peech

Proceedings of the Institute for Electronics, Information and Communication Engineers
(IEICE)
and The Acoustical Society of Japan, Speech
Dyn
amics by Ear, Eye, Mouth and Machine: An Interdisciplinary Workshop
, Kyoto, Japan, 27 June, 2003
.

7

i nt el l i gi bi l i t y t o aut omat i c s peech r ecogni t
i on.
Exact l y how t o us e t hi s i nf or mat i on bes t wi l l
r equi r e f ur t her wor k ai med at del i neat i ng t he
pr eci s e r el at i ons bet ween acous t i c and vi s ual
modul at i on s pect r a and t he ext ent t o whi ch t hi s
i nf or mat i on i s spect r al l y s peci f i c. Wor k al ong
t hes e l i nes i s cur
r ent l y under way.

3.

Acknowl edgment s

Thi s r es ear ch was s uppor t ed by t he Cl i ni cal
I nves t i gat i on Ser vi ce, Wal t er Reed Ar my Medi cal
Cent er, under Wor k Uni t #2508 and by gr ant
number s DC 000792
-
01A1 f r om t he Nat i onal
I ns t i t ut e on Deaf nes s and Ot her Communi cat i on
D
i sor der s t o Wal t er Reed Ar my Medi cal Cent er
and SBR 9720398 f r om t he Lear ni ng and
I nt el l i gent Sys t ems I ni t i at i ve of t he Nat i onal
Sci ence Foundat i on t o t he I nt er nat i onal Comput er
Sci ence I ns t i t ut e. Al l subj ect s par t i ci pat i ng i n t hi s
r es ear ch pr ovi ded wr i t t e
n i nf or med cons ent pr i or
t o begi nni ng any of t he des cr i bed st udi es. I woul d
l i ke t o t hank Dr. J enni f er Tuf t s f or her hel pf ul
comment s on an ear l i er dr af t of t hi s paper. The
opi ni ons or as s er t i ons cont ai ned her ei n ar e t he
pr i vat e vi ews of t he aut hor and ar e

not t o be
cons t r ued as of f i ci al or as r ef l ect i ng t he vi ews of
t he Depar t ment of t he Ar my or t he Depar t ment of
Def ens e.

Ref erences

[ 1]

Sumby, W.H., and Pol l ack, I. ( 1954). "Vi s ual
cont r i but i on t o s peech i nt el l i gi bi l i t y i n
noi s e," J. Acoust. Soc. Am., 26, 212
-
2
15.

[ 2]

Gr ant, K.W., Ar del l, L.H., Kuhl, P.K., and
Spar ks, D.W. ( 1985). "The cont r i but i on of
f undament al f r equency, ampl i t ude envel ope,
and voi ci ng dur at i on cues t o s peechr eadi ng i n
nor mal
-
hear i ng s ubj ect s," J. Acous t. Soc. Am.
77, 671
-
677.

[ 3]

Gr ant, K.W., Ar del l
, L.H., Kuhl, P.K., and
Spar ks, D.W. ( 1986). "The t r ans mi s s i on of
pr osodi c i nf or mat i on vi a an el ect r ot act i l e
speechr eadi ng ai d," Ear and Hear i ng, 7,
328
-
335.

[ 4]

Mazeas, R. ( 1968). "Hear i ng capaci t y, i t s
meas ur ement and cal cul at i on," Amer. Annal s
of t he Deaf 1
13, 268
-
274.

[ 5]

Ros en, S.M., Four ci n, A.J., and Moor e, B.C.J.
( 1981). "Voi ce pi t ch as an ai d t o l i pr eadi ng,"
Nat ur e 291 ( 5811), 150
-
152.

[ 6]

Br eeuer, M., and Pl omp, R. ( 1984).
"Speechr eadi ng suppl ement ed wi t h
f r equency
-
s el ect i ve sound
-
pr es s ur e
i nf or mat i on," J. Ac
ous t. Soc. Am. 76,
686
-
691.

[ 7]

DeFi l i ppo, C.L., and Scot t, B.L. ( 1978). "A
met hod f or t r ai ni ng and eval uat i ng t he
r ecept i on of ongoi ng s peech," J. Acoust. Soc.
Am. 63, 1186
-
1192.

[ 8]

Gr ant, K.W., Br ai da, L.D., and Renn, R.J.
( 1991). "Si ngl e
-
band ampl i t ude envel op
e
cues as an ai d t o s peechr eadi ng," Quar t er l y J.
Exp. Ps ych. 43, 621
-
645.

[ 9]

Gr ant, K.W., Br ai da, L.D., and Renn, R.J.
( 1994). "Audi t or y s uppl ement s t o
speechr eadi ng: Combi ni ng ampl i t ude
envel ope cues f r om di ff er ent s pect r al r egi ons
of speech," J. Acous t. So
c. Am. 95,
1065
-
1073.

[ 10]

Fr ench, N.R., and St ei nber g, J.C. ( 1947).
"Fact or s gover ni ng t he i nt el l i gi bi l i t y of
speech s ounds," J. Acous t. Soc. Am., 19,
90
-
119.

[ 11]

Fl et cher, H., and Gaul t, R.H. ( 1950).
"The per cept i on of s peech and i t s r el at i on t o
t el ephony," J. Ac
ous t. Soc. Am., 22, 89
-
150.

[ 12]

Hout gas t, T., and St eeneken, H.J.M.
( 1980). "Pr edi ct i ng speech i nt el l i gi bi l i t y i n
r ooms f r om t he modul at i on t r ans f er f unct i on.
I. Gener al r oom acous t i cs," Acous t i ca 46,
60
-
72.

[ 13]

Amer i can Nat i onal St andar ds I ns t i t ut e
( 1969). "Amer i
can Nat i onal St andar d
Met hods f or t he Cal cul at i on of t he
Ar t i cul at i on I ndex," ANSI S3.5
-
1969,
Amer i can Nat i onal St andar ds I ns t i t ut e, New
Yor k.


Proceedings of the Institute for Electronics, Information and Communication Engineers
(IEICE)
and The Acoustical Society of Japan, Speech
Dyn
amics by Ear, Eye, Mouth and Machine: An Interdisciplinary Workshop
, Kyoto, Japan, 27 June, 2003
.

8

[ 14]

Gr ant, K.W., and Br ai da, L.D. ( 1991).
"Eval uat i ng t he Ar t i cul at i on I ndex f or
audi ovi s ual i nput," J. Acous t. Soc.

Am., 89,
2952
-
2960.

[ 15]

Gr ant, K.W., and Wal den, B.E. ( 1996).
"Eval uat i ng t he ar t i cul at i on i ndex f or
audi t or y
-
vi s ual cons onant r ecogni t i on," J.
Acous t. Soc. Am., 100, 2415
-
2424.

[ 16]

Mi l l er, G.A., and Ni cel y, P.E. ( 1955). "An
anal ys i s of per cept ual conf us i ons amon
g
some Engl i sh consonant s," J. Acoust. Soc. Am.
27, 338
-
352.

[ 17]

Mas s ar o, D.W. ( 1987).
Speech Percept i on
by Ear and Eye: a Paradi gm f or Ps ychol ogi cal
I nqui r y
. Hi l l sdal e, NJ: Lawr ence Er l baum
As s oc.

[ 18]

Mas s ar o, D.W. ( 1998).
Percei vi ng
Tal ki ng Faces: From Speech Pe
rcept i on t o a
Behavi oral Pr i nci pl e.

Cambr i dge, MA: MI T
Pr es s.

[ 19]

Br ai da, L.D. ( 1991). "Cr os s modal
i nt egr at i on i n t he i dent i f i cat i on of cons onant
s egment s," Quar t. J. Exp. Ps ych., 43,
647
-
677.

[ 20]

Summer f i el d, Q. ( 1987). "Some
pr el i mi nar i es t o a compr ehens i ve acco
unt of
audi o
-
vi sual s peech per cept i on," i n B. Dodd
and R. Campbel l ( Eds.)
Heari ng by Eye: The
Ps ychol ogy of Li p
-
Readi ng.

Hi l l sdal e NJ:
Lawr ence Er l baum As s oci at es, 3
-
52.

[ 21]

Li ndbl oom, B.

( 1996). "Rol e of
ar t i cul at i on i n speech per cept i on: Cl ues f r om
pr oduct i o
n," J. Acous t. Soc. Am. 99,
1683
-
1692.

[ 22]

Gr ant, K.W., and Sei t z, P.F. ( 2000). "The
us e of vi s i bl e speech cues f or i mpr ovi ng
audi t or y det ect i on of s poken s ent ences," J.
Acous t. Soc. Am., 108, 1197
-
1208.

[ 23]

Gr ant, K.W. ( 2001). "The ef f ect of
speechr eadi ng on mas k
ed det ect i on
t hr es hol ds f or f i l t er ed s peech," J. Acous t. Soc.
Am. 109, 2272
-
2275.

[ 24]

Hal l, J.W., Haggar d, M.P., and Fer nandes,
M.A. ( 1984). "Det ect i on i n noi s e by
spect r o
-
t empor al pat t er n anal ys i s," J. Acous t.
Soc. Am., 76, 50
-
56.

[ 25]

Mer edi t h, M.A., and St ei n, B
.E. ( 1996).
"Spat i al det er mi nant s of mul t i s ensor y
i nt egr at i on i n cat super i or col l i cul us," J
Neur ophys i ol 75, 1843
-
1857.

[ 26]

Dr ul l man, R., Fes t en, J., and Pl omp, R.
( 1994). Ef f ect f envel ope s mear i ng on speech
per cept i on," J. Acous t. Soc. Am. 95,
1053
-
1064.

[ 27]

A
r ai, T., Pavel, M., Her mans ky, H., and
Avendano, C. ( 1996). "I nt el l i gi bi l i t y of
speech wi t h f i l t er ed t i me t r aj ect or i es of
spect r al envel opes," Pr oc. I CSLP, 2490
-
2492.

[ 28]

Gr eenber g, S. ( 1997). "On t he or i gi ns of
speech i nt el l i gi bi l i t y i n t he r eal wor l d," Pr oc.

ESCA Wor ks hop on
Robus t Speech
Recogni t i on f or Unknown Communi cat i on
Channel s
, 23
-
32.