S P E A K E R D E P E N D E N T S P E E C H R E C O G N I T I O N B A S E D O N P H O N E L I K E U N I T S

movedearAI and Robotics

Nov 17, 2013 (3 years and 7 months ago)

69 views

SPEAKER DEPENDENT SPEECH RECOGNITION BASED ON PHONE LIKE UNITS
MODELS  APPLICA TION TO V OICE DIALING
Vinc ent F ontaine and Herv e Bourlar d
F acult e P olytec hnique de Mons  Boulev ard Dolez B Mons BELGIUM
T el  
 
 F ax 
 
Email fon taineb ourlardtctsfpmsacb e
ABSTRA CT
This pap er presen ts a sp eak er dep enden t sp eec h recognition
with application to v oice dialing This w ork has b een dev el
op ed under the constrain ts imp osed b y v oice dialing appli
cations ie lo w memory requiremen ts and limited training
material Tw o metho ds for pro ducing sp eak er dep enden t
w ord baseforms based on Phone Lik e Units PLU are pre
sen ted and compared   a classical v ector quan tizer is
used to divide the space in to regions asso ciated with PLUs
  a sp eak er indep enden t h ybrid HMMMLP recognizer
is used to generate sp eak er dep enden t PLU based mo dels
This w ork sho ws that v ery lo w error rates can b e ac hiev ed
ev en with v ery simple systems namely a DTWbased recog
nizer Ho w ev er b est results are ac hiev ed when using the
h ybrid HMMMLP system to generate the w ord baseforms
Finally  a realtime demonstration sim ulating v oice dialing
functions and including k eyw ord sp otting and rejection ca
pabilities has b een set up and can b e tested online
 INTR ODUCTION
V oice dialing is t ypically based on sp eak er dep enden t sp eec h
recognition systems in whic h eac h sp eak er can easily dene
hisher o wn p ersonal rep ertory con taining the set of com
mands or k eyw ords that will b e used later on to automat
ically dial phone n um b ers The set up of suc h a system is
usually based on t w o phases
 Enrollmen t phase The user pronounces sev eral times
in our case t wice eac h of the k eyw ords and pro vides
the system with their asso ciated phone n um b er Ide
ally  this enrollmen t should b e as fast and exible as
p ossible
 Recognition phase The user pronounces a k eyw ord
and the system automatically dials the asso ciated
phone n um b er F urthermore if sev eral sp eak ers b e
longs to the same directory  the system should b e able
to also iden tify the sp eak er in the case of similar k ey
w ords
One simple solution to w ards fast enrollmen t is based on
standard template matc hing approac hes simply storing se
quences of acoustic v ectors asso ciated with eac h utterance
and dynamic time w arping DTW This approac h ho w ev er
suers from ma jor dra wbac ks namely high memory storage
requiremen ts and p o or robustness against the v ariabilit y of
the test conditions
Alternativ e solutions to the straigh tforw ard DTW ap
proac h ha v e b een prop osed in the past In   HMMs
are automatically deriv ed from the k eyw ords pronounced
b y the user during the enrollmen t phase The training of
suc h mo dels ho w ev er require a large n um b er of examples
whic h mak es the system less exible and less attractiv e to
the user Another solution 
 somewhat related to what will
b e in v estigated in the curren t pap er is to use the sym b olic
string pro duced b y a sp eak er indep enden t sp eec h recognizer
to represen t the k eyw ord Compared to DTW this leads to
nearly equiv alen t recognition rates with the adv an tage of a
drastic reduction of the memory requiremen ts
In this pap er t w o metho ds for automatically generating
some kind of sp eak er sp ecic mo dels based on phonelik e
units PLU are tested
 Section  mak es use of a standard v ector quan tizer to
design the PLUs
 Section  uses a sp eak er indep enden t h ybrid hid
den Mark o v mo del HMM  m ultila y er p erceptron
MLP system to generate PLUbased sp eak er dep en
den t mo dels
Although some of the approac hes used in this pap er ha v e al
ready b een in v estigated in the past see eg   in the case
of DTW with v ector quan tization they are no w tested in
the particular framew ork of v oice dialing application and in
clude  stateoftheart acoustic features see Section 
and   k eyw ord sp otting capabilities see Section 
 A COUSTIC FEA TURES AND D A T ABASES
All the exp erimen ts rep orted in this pap er used  rastaplp
cepstral co ecien ts   extracted from  ms sp eec h frames
shifted b y  ms The cepstral co ecien ts w ere liftered b y
a sine windo w
w  i    
N

sin 
i
N

with N   in our case As discussed at the end of Sec
tion  the use of delta features did not seem to impro v e
p erformance in the case of the particular mo dels used here
Due to the lac k of appropriate databases to study v oice
dialing systems w e decided to test our algorithms on the
BDSONS database designed for sp eak er dep enden t sp eec h
recognition and consisting of the  isolated F renc h dig
its pronounced  times b y  sp eak ers   digit utter
ancessp eak er F or eac h sp eak er  utterances the rst
t w o utterances of eac h digit w ere retained to create the
w ord mo dels ie to sim ulate fast enrollmen t of new k ey
w ords In this pap er this database will b e referred to as
enrollmen t database  The  remaining digit utterances
w ere used for testing
￿
and are referred to as test database 
Since v oice dialing systems should ideally b e indep enden t
of the language and of the training database w e decided to
train the co deb o oks or the neural net w ork on the TIMIT
database ie a database recorded in another language and
designed for another sp eec h recognition task In this pap er
this database will b e referred to as training database 
 D YNAMIC PR OGRAMMING AND
VECTOR QUANTIZA TION
The rst kind of phonelik e units that ha v e b een tested w ere
built up from a standard Kmeans v ector quan tizer based
on standard Euclidian distances
￿
 dividing the acoustic pa
rameter space in to regions represen ting PLUs In this case
a Kmeans clustering w as applied to the whole set of the
TIMIT acoustic v ectors training database
During enrollmen t transforming the enrollmen t utter
ances in to sequences of PLUs PLUbased w ord mo dels
are built up b y rst replacing eac h v ector of the enroll
men t utterances b y the lab el of the closest protot yp e The
resulting lab el sequences are then further pro cessed accord
ing to the follo wing simple rule sequences of the same la
b el are reduced to sequences of length n  indicating sta
tionary parts of sp eec h while transition parts are left un
c hanged F or example if w e supp ose n   the lab el se
quence f   


 g will b e turned
in to f   

 g  The parameter n will b e
referred to as the sequence c ompr ession factor in the sequel
The resulting compressed lab el sequence w as stored as the
w ord mo del resulting in a signican t reduction of the mem
ory requiremen ts compared to storing the acoustic v ector
sequences with an a v erage of  to  b ytes p er w ord An
other consequence of the sequence compression pro cedure is
the gain in CPU time for the dynamic programming that is
prop ortionnal to the storage gain Also as already sho wn
in   this kind of mo delling also has a smo othing eect
o v er time and frequency that can result in sligh tly b et
ter recognition p erformance The enrollmen t pro cedure is
illustrated in Figure 
The sp eak er dep enden t c haracter of the mo dels giv es
the abilit y to the system to discriminate k eyw ords pro
nounced b y dieren t sp eak ers This case is t ypically en
coun tered when t w o p ersons in tro duce the same k eyw ord
eg Mom in the same enrollmen t database but with
dieren t phone n um b ers asso ciated to this k eyw ord
During recognition and unlik e some metho ds prop osed
in the past and unlik e discrete HMMs the input v ectors
are not quan tized
￿
 it is indeed not necessary to p erform
￿
W e note here that apart from the n um b er of activ e w ords
this task could b e harder than v oice dialing applications since
 k eyw ords are quite short and  some of them are quite
confusable lik e cinq and sept
￿
Mahalanobis distances w ere also tested but nev er led to sig
ni can t impro v emen ts
￿
This has b een tested and has b een sho wn to lead to signi 
can tly lo w er p erformance
V.Q.
Label Sequence
Sequence
Compression
Acoustic Vectors
Word Model
Figure  Enrollmen t pro cedure  Acoustic v ectors
are rst quan tized to pro duce lab el sequences that
are further pro cessed to pro duce w ord mo dels
v ector quan tization during the recognition phase since it
will only in tro duce unnecessary computation and distortion
Instead lo cal distances of the dynamic programming grid
are computed b et w een the test v ectors and the cen troids
corresp onding to the lab els of the mo dels
The results presen ted in T able  sho w that v ery high ac
curacy has b een obtained for this task esp ecially when the
co deb o ok is designed with
cen troids W e tried to im
pro v e these results b y adding the rst deriv ativ es of the
cepstral co ecien ts and the rst and second deriv ativ es of
the logenergy  These additional parameters w ere quan
tized b y separate co deb o oks  co deb o oks in total and the
training utterances w ere then mo deled b y sequences of la
b els The lo cal distances of the dynamic programming grid
w ere computed as a w eigh ted sum of the distances b et w een
the v ector comp onen ts and the nearest cen troid of the cor
resp onding co deb o ok Sev eral w eigh ting congurations of
the distances w ere tested but nev er lead to signican t im
pro v emen t of results obtained with static parameters only 
A p ossible explanation to this is that the w ord mo dels are
so detailed that they implicitely include a go o d description
of the dynamics
T ests in a m ultisp eak er en vironmen t w ere also p erformed
to study the abilit y of the system to discriminate b et w een
sp eak ers The mo dels of the ten digits for sp eak ers w ere
considered as the enrollmen t database and the test set w as
comp osed of the remaining utterances for these sp eak ers
T able  sho ws that discrimination b et w een sp eak ers and
k eyw ords  b ecomes prett y go o d as the size of the co deb o ok
increases This is not surprising since more rened the PLU
space b ecomes and more the sp eak er sp ecic c haracteristics
are captured b y the system
T able presen ts the inuence of the compression of the
lab el sequences on the error rate W e can observ e that
the b est results are obtained without compression but that
the error rate is not quite sensitiv e to the compression pa
rameter Ev en for n   the error rate is only of 
to b e compared to  without compression while stor
age and computation requiremen ts are reduced b y appro x
imately  
!Classes
Monosp eak er
Multisp eak er



 
 






T able  Inuence of the n um b er of V Q classes on
the error rate In the monosp eak er case the tests
ha v e b een p erformed indep enden tly for eac h of the
 sp eak ers the enrollmen t set only con tained mo d
els of the tested sp eak er
and the results ha v e b een
a v eraged In the m ultisp eak er case mo dels of
sp eak ers w ere recognized sim ultaneously 
Compression factor
Error Rate











T able  Inuence of the compression factor applied
to stationary parts of lab el sequences The tests
ha v e b een p erformed in the monosp eak er task A
compression factor of lea v es the lab el sequences
unc hanged
 HYBRID HMM MLP V OICE DIALING
The approac h discussed no w follo ws the same principle than
the metho d presen ted in Section  with the dierence
that the unsup ervised Kmeans clustering is replaced b y
the sup ervised training of a m ultila y er p erceptron MLP
as used in the framew ork of sp eak er indep enden t h ybrid
HMMMLP systems  
As in h ybrid HMMMLP systems the MLP net w ork
is trained in a sup ervised w a y p ossibly within em b edded
Viterbi to yield p osterior probabilities of phone classes as
so ciated with the MLP outputs conditioned on the input
v ectors presen ted to the net w ork This training is done
in a sp eak er indep enden t mo de on TIMIT in the curren t
w ork In the curren t system though as opp osed to stan
dard HMMMLP recognizers the trained net w ork is then
used for t w o dieren t goals
 T o automatically infer the mo del top ology in terms of
PLU sequence of the v oice dialing k eyw ords
 T o compute lo cal DTW distances b et w een the test ut
terance and the infered mo dels
The sequence of PLUs asso ciated with eac h enrollmen t
utterance w as then generated in t w o steps  replacing
eac h frame b y the lab el of the phonemic class asso ciated
with the highest p osterior probabilit y observ ed on the MLP
outputs and   applying the time compression sc heme as
used in Section  The enrollmen t pro cedure as applied for
h ybrid v oice dialing is illustrated in Figure 
Recognition w as then p erformed b y dynamic program
ming where the lo cal distances b et w een eac h input v ector
x
n
of the test utterance and the PLUs comp osing the refer
ence w ords w ere dened as the Euclidian distance b et w een
Acoustic Vectors
M.L.P
Labeling
MAP Classification
Sequence
Compression
Label Sequence
A Posteriori Probabilities
Word Model
Figure  Enrollmen t pro cedure  Acoustic v ectors
are lab eled according to a MAP classication cri
terion The lab el sequences are further pro cessed
to pro duce w ord mo dels as in the case of the V Q
based recognizer
the v ector of a p osteriori probabilities generated b y the net
w ork for x
n
and the v ector of ideal a p osteriori probabilities
corresp onding to the PLUs of the training utterances
￿

The v ector of ideal a p osteriori probabilities for the PLU
q
i
 noted d  q
i
 corresp onds actually to the desired outputs
as presen ted to the net w ork during its training phase 
d
k
 q
i
  
k i
   k  K
where K is the n um b er of PLUs and 
k i
is the usual Kro
nec k er delta function whic h is only nonzero and equal to
 when k  i  The lo cal distance b et w een x
n
and the PLUs
q
i
can then b e expressed as
D  x
n
 q
i
 
K
X
k ￿￿
 g
k
 x
n
  d
k
 q
i

￿
where g  x
n
 represen ts the output probabilities of the MLP 
As in the previous section the MLP w as trained on the
TIMIT database English and tested with enrollmen t on
the BDSONS database con taining F renc h digits Recog
nition results are presen ted in T able  and sho w that h ybrid
systems sligh tly outp erform the results obtained with the
b est conguration of the DTW based recognizer These re
sults also indicate that it is not necessary to divide the MLP
output probabilities b y the prior probabilities of the MLP
output classes as usually done in standard HMMMLP sys
tems This can b e explained b y the fact that the w ord
baseforms top ologies are directly inferred from the MLP 
Multisp eak er tests ha v e also b een p erformed for the h y
brid systems The tests ha v e b een conducted in the same
w a y as for the DTW based recognizer and indicate that the
h ybrid system is not able to discriminate the sp eak ers as
￿
Kullbac kLeibler distance has also b een tried but nev er out
p erformed results obtained with an Euclidean distance
Net w ork size
sp eak er
sp eak er
sp eak er
abc
Div b y
No Div
No Div
priors
b y priors
b y priors
  




 




T able  Sp eak er dep enden t recognition error rates
with neural net w orks abc
represen t the n um
b er of input no des hidden no des and output no des
resp ectiv ely  F eature v ectors of the rst net w ork
included dynamic parameters   rasta  
"rasta "logenergy ""logenergy
while only
static parameters    rasta  logenergy
w ere
used for the second net w ork In the monosp eak er
case sp eak er
 the tests ha v e b een p erformed in
dep enden tly for eac h of the  sp eak ers the en
rollmen t set only con tained mo dels of the tested
sp eak er
and the results ha v e b een a v eraged In the
m ultisp eak er case sp eak ers
 mo dels of sp eak
ers w ere recognized sim ultaneously 
Compression factor
Error Rate










T able  Inuence of the compression factor applied
to stationary parts of lab el sequences The tests
ha v e b een p erformed using the h ybrid HMM MLP
system in the same conditions as for the DTW based
recognizer
go o d as V Q deriv ed PLU This can b e explained b y the
fact that the MLP is trained in a sup ervised w a y to learn
sp eak er indep enden t realisations of phonemes Therefore
the sp eak er dep enden t mo dels of a k eyw ord will b e close to
eac h other for all sp eak ers
T able sho ws the inuence of the lab el sequence com
pression factor on the error rate Here it is quite remark
able and particularly in teresting to observ e that unlik e
the previous DTW based system sequence compression al
w a ys results in a signican t reduction of the error rate This
allo ws to reduce signican tly memory storage ab out

storage gain without an y degradation of the error rate
 KEYW ORD SPOTTING
The t w o algorithms discussed in this pap er ha v e b een
adapted to accommo date k eyw ord sp otting b y using a sligh t
adaptation of the metho d presen ted in  and referred to
as online garbage In this case a ctitious garbage
unit is in tro duced in the dynamic programming for whic h
the lo cal score is computed as the a v erage of the Nb est dis
tances b et w een eac h of the test frame and the reference la
b els Keyw ord sp otting is then p erformed simply b y adding
this garbage unit at the b eginning and at the end of eac h
w ord mo del syn tax allo wing garbagek eyw ordgarbage
or garbagegarbage in the case of rejection
 CONCLUSION
In this pap er t w o approac hes for sp eak er dep enden t sp eec h
recognition tasks based on generation of phonelik e units
sequences ha v e b een tested and compared
In b oth cases v ery high accuracies can b e ac hiev ed A
realtime demonstration of a v oice dialing system based
on the tec hnology discussed in this pap er and includ
ing rejection and k eyw ord sp otting capabilities has b een
implemen ted and can b e tested at  
  Bel
gian site Information on the use of the demonstra
tion system is a v ailable on our w eb site at the address
h ttptctsfpmsacb esp eec hsoftdialh tml
REFERENCES

 JM Boite H Bourlard B D ho ore and M Haesen A
new approac h to w ards k eyw ord sp otting  in Pr o c e e dings of
EUR OSPEECH  pp    

 H Bourlard H Ney and CJ W ellek ens Connected
Digit Recognition using V ector Quan tization  IEEE Pr o c
Intl Conf on A c oustics Sp e e ch and Signal Pr o c essing
pp  

 H Bourlard and N Morgan Conne ctionist Sp e e ch R e c o gni
tion Klu w er Academic Publishers 

 D Geller R HaebUm bac h and H Ney  Impro v emen ts in
sp eec h recognition for v oice dialing in the car en vironmen t 
in Pr o c e e dings of Sp e e ch Pr o c essing in A dverse Conditions
pp  

 H Hermansky and N Morgan RAST A pro cessing of
sp eec h  IEEE T r ans on Sp e e ch and A udio Pr o c essing
v ol  no  pp   

 N Jain R Cole and E Barnard Creating sp eak ersp eci c
phonetic templates with a sp eak erindep enden t phonetic
recognizer Implications for v oice dialing  in Pr o c e e dings of
ICASSP  pp