PhD Three Month Report

birthdaytestΤεχνίτη Νοημοσύνη και Ρομποτική

17 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

62 εμφανίσεις




PhD Three Month Report



1

PhD
Three Month Report


1 Introduction


Automatic Speech Recognition (ASR) is
mainly
concerned with

automatic
conversion

of an
acoustic
speech
signal into
a text

transcription

of that
speech
.
Many

approaches
to

automatic speech
recognition have been studi
e
d in the past
,
including approaches based on
Artificial Intelligence (AI)
,

pattern
template matching
techniques

and Artificial Neural Networks (ANN). Currently, the use of
Hidden Markov Models (HMMs)

is the most popular statistical approach to automatic sp
eech
recognition.
At present
,
all
state
-
of
-
the
-
art commercial and
most
laboratory speech recognition
systems

are based on HMMs approach
.


Generally speaking, the goal of the first three months is to gain a comprehensive perspective
throughout HMM
-
based sp
eech recognition techniques, from conventional HMMs
(Young 1995)
to
segment HMMs
(Holmes, Russell 1999)
and to the newly developed trajectory HMMs

(Tokuda et al
2000)
. Moreover, some highly relevant areas to speech recognition, for example, machine learning,

have also been explored during the past three months. As far as this report is concerned, the aim is
to provide a summary of background reading and literature survey done so far. More importantly,
an intended research question is proposed based on what ha
ve been reviewed on speech recognition.
Finally, a plan and timetable for the next six months are represented at the end of the report.


2 Literature Review


2.1 HMM
-
based speech recognition system overview


The performance of automatic speech recognition
systems has been greatly advanced as a result of
the
employment

of Hidden Markov Model
s

(HMM
s
) since 1980s.
HMMs assume an utterance as a
time
-
varying sequence of segments
. A
ll variations between
various

realizations of the
segments

which caused by
differe
nt speaking style, different speakers etc
are normally modelled
by using
multiple
-
component Gaussian mixture
s
.
Figure 1 below shows a typical phoneme
-
level HMM
-
based large vocabulary continuous speech recognition system.
It can be seen that the main
compon
ents of a speech recognition system
include

Front
-
end processing
model
, acoustic model,
language model and Viterbi decoder.




PhD Three Month Report



2

Front
-
end Processing
Speech Waveform
Acoustic vector
sequence
Y
=
{
y
1
,
y
2
,

,
y
T
}
Language
Model
Acoustic Model
(
Phoneme
-
level HMM store
)
Pronunciation
Dictionary
Viterbi Decoder
Optimal word
sequence
w
1
,
w
2
,

,
w
N
P
(
W
)
P
(
Y
|
W
)

Figure 1 Schematic diagram of a phoneme
-
level HMM
-
based speech recognition system


A
ny speech recognit
ion
begins with

automatic transformation of
the speech waveform into a

phonetically meaningful, compatible and compact
representation which is
appropriate

for
r
ecognition
.

This initial stage of speech recognition is referred to as
front
-
end processing. For

instance, the speech is firstly segmented, usually 20
-
30ms with 10ms overlap between
adjacent

segments. Each short speech segment is windowed and a Discrete Fourier Transform (DFT) is
thereafter applied. The resultant complex spectral coefficients are tak
en the modulus and logarithm
to produce the log power spectrum. Mel frequency averaging is then applied, followed by a Discrete
Cosine Transform (DCT) to
produce

a set of
M
el
F
requency
C
epstral
C
oefficients

(
MFCC
)
.
Alternative front
-
end analyses are also u
sed. Examples include linear prediction, Perceptual Linear
Prediction (PLP) etc.


MFCCs
are widely used as the acoustic feature vectors in speech recognition. However, they
do not
account for the underlying speech dynamics which cause speech changes. A sim
ple way to
capture

speech
dynamics is to
calculate

the first and sometimes the second time derivatives of
the static
feature vectors

and concatenate the static and dynamic feature vectors together
. Although the use of
dynamic features improves recognition
performance, the independence assumption actually



PhD Three Month Report



3

becomes even less valid because the observed data for any one frame are used to contribute to a
time span of several feature vectors (Holems, Russell 1999).



Given

a
sequence

of acoustic
vectors


1 2
,,...,
T
y y y y

, the speech recognition task is to find the
optimal words
sequence


1 2
,,...,
L
W w w w

,
which
maximizes

the
probability


|
P W y
,

i.e.
the
probability of the word sequence
W
given the acous
tic feature
vectors
y
. This probability is
normally computed by using Bayes Theorem:









|
|
P y W P W
P W y
P y


In the equation above,


P W

is the probability of the word
sequence
W
,

named language model
probability and computed using a language model.
The probability


|
P y W

is referred to as
acoustic model probability, which is the probability that the acoustic feature vector
y
is produced
by
the word
sequence
W
.


|
P y W
is calculated by using acoustic model.


Phoneme
-
level HMMs represent phoneme units instead of a whole word. The pronunciation
dictionary specifies the sequence of HMMs for each wo
rd in the vocabulary. For a given word
sequence, the pronunciation dictionary finds and concatenates corresponding phoneme HMMs for
each word in the sequence. The probabilities


|
P y W
and


P W
are then multiplied tog
ether using
the Viterbi decoder for all possible word sequences allowed by the language model. The one with
the maximum probability is selected as the
optimal word sequence
.



2.2

Segment HMMs


Although
conventional
HMM
-
based
systems

have achieved impress
ive performance on certain type
of speech recognition tasks, there have been discussions on its inherent deficiencies when applied to
speech. For example, HMMs assume independent observation vectors given the state sequence,
ignoring the fact that constrai
nts of articulation are such that any one frame of speech is highly
correlated with previous and following frames (Holmes, 2000). Generally speaking, the piecewise
stationarity assumption, the independence assumption and poor state duration modelling are t
hree
major limitations of HMMs for speech. These assumptions are made for mathematical tractability
and are clearly inappropriate for speech.





PhD Three Month Report



4

1
2
3
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
Acoustic feature
vectors
HMM states


Figure 2 Schematic diagram of a segment HMM


A variety of refined and alternative
models have been investigated to overcome limitations of
conventional HMMs. In particular, a number of “Segment HMMs” have been exploited, in which
variable
-
length sequence of observation vectors or ‘segments’ rather than individual observation are
associa
ted with models, as is shown in Figure 2.
A major motivation for the development of

segmental


acoustic models was the opportunity to exploit acoustic features which are apparent at
the segmental but not at the frame level (Holmes, Russell 1999).
Examples

of segment HMMs
include Digalakis (1992), Russell (1993) and Gales and Young (1993). Ostendorf, Digalakis and
Kimball (1996) represented a comprehensive review on a variety of segment HMMs.


Segment

HMMs have
several

advantages against conventional HMMs.

Firstly,

the length of
segments is variable, which allows
a
more realistic duration model to be easily incorporated.
The

state duration in a conventional HMM is
modelled

by a geometric distribution, which is unsuitable
for speech segment duration modellin
g. Other
relevant
duration models with explicit distribution of
state occupancy have been studied by Russell (1985) and Levi
nson (1986). Secondly, segment

HMMs allow relationship between feature vectors within a segment to be represented explicitly,
usuall
y
taking
some

form of

trajectories in the acoustic vector space. There are many different type
of trajectory
-
based segmental HMMs. Examples include the simplest trajectory form in which
the
trajectory is constant over time (Russell 1993; Gales, Young 1993)
,
the
linear trajectory
characterized by
a
mid
-
point and slope (Holmes, Russell 1999)
,

and
the
non
-
parametric trajectory
model (Ghitza and Sondhi 1993).



2.3
Multiple
-
level segment HMM (MSHMM)


It is generally believed that any type of incorporation of sp
eech dynamics into acoustic models will
eventually lead to improvements

in speech recognition performance.
What is
interesting and of



PhD Three Month Report



5

potential benefits

is to

model speech dynamics directly

within the existing HMMs framework and
hence retaining
the
advanta
ges of HMMs, for instance the well
-
studied training and decoding
algorithm. A good example is the “Balthasar” project conducted at the University of Birmingham
(Russell, Jackson 2005). The “Bathasar” project introduced a novel multiple
-
level segmental hid
den
Markov model (MSHMM)
, which
incorporates

an
intermediate ‘articulatory’ layer to regulate the
relationship between underlying states and surface acoustic feature vectors.


1
2
3
Acoustic layer
Articulatory
-
to
-
acoustic mapping
Intermediate layer
Finite state process
w

Figure 3 A multiple
-
level segmental model that us
es linear trajectories in an intermediate space.


Figure 3 shows
the

multiple
-
level segmental HMMs (MSHMM) proposed in the ‘Balthasar’ project.
It can be seen clearly that the relationship between the state
-
level and acoustic
-
level descriptions of
a speec
h signal is regulated by an intermediate, ‘articulatory’ representation (Russell, Jackson 2004).
T
o construct the ‘articulatory’ space
,

a

sequence of Parallel Formant Synthesiser (PFS) control
parameters is used.


A typical parallel formant synthesiser (H
olmes, Mattingley and Shearme 1966) consists of a series
of resonators arranged in parallel, usually four, representing the first four formant frequencies and
the corresponding amplitude control for each formant resonator plus a voicing control to determin
e
the type of excitations. A sequence of 12 dimensional vectors, including the frequencies and
amplitudes of the first three formants, the fundamental frequency and degree of voicing etc, is sent
every 10
-
20ms to drive the synthesiser.
It has been demonstr
ated that t
he derivation of appropriate
control parameters determines to what extent the synthesis quality would be.





PhD Three Month Report



6

The segmental model used in the
MSHMM

framework is a fixed, linear trajectory segmental HMM
(FT
-
SHMM) (Holmes, Russell 1999), in which eac
h acoustic segment is considered as a variable
-
duration, noisy function of a linear trajectory. The segment trajectory of duration


is characterized
by




f t t t m c
  
,
where


1
2
t



,
m

and
c
are slope vector and midpoint vector
respectively. Assume
1 1 2
[,,...,]
y y y y



as the acoustic vectors associated with
state
i

, the state
output
probability

is given as follows:









1
1
|,
i i t i i
t
b y c y f t V




 

,

where


i
c

is the duration
PDF

and




|,
t i i
y f t V

is the multivariate Gaussian
PDF

with mean


i
f t

and diagonal
N N


covariance
i
V
.


The novelty of the MSHMM is to define the trajectory


f t
in the intermediate or articulatory layer
and then map onto to the acoustic layer by using an

articulatory
-
to
-
acoustic

mapping
function
W
,

i
.e.











1
1
|,
i i t i i
t
b y c y W f t V




 


Here the mapping function
W
is piecewise linear.
The

linear mapping work has been doing by Miss
Zheng and research on non
-
linear mapping has already started.


The principle of Viterbi decoding stil
l applies to segmental HMMs, though the computational load
is more expensive than conventional HMM. The segmental Viterbi decoder is shown as below:









1
ˆ
ˆ
maxmax
t
t t ji i t
j
i j a b y
 

 
  


where


1
t
i t
b y

 

is the segmental state output probabilities.


Although multiple
-
layer segmental HMM is at very early stage, it provides a solid theoretical
foundation for the development of richer classes of multi
-
level models, including non
-
linear models
of dynamics and alternative articulatory representations (Ru
ssell, Jackson 2005). The incorporation
of an intermediate layer has many advantages. It introduces the possibility of better speech
dynamics modelling and articulatory
-
to
-
acoustic mapping
.






PhD Three Month Report



7

2.3

Trajectory HMM


Recently, a new type of model “Tajectory HMM
” was proposed by Tokuda, Zen and Kitamura
(2003). The trajectory HMM is based on the conventional HMMs whose acoustic feature vectors
include static and dynamic features. It intends to overcome the limitations of constant statistics
within an HMM state an
d independence assumption of state output probabilities. The conventional
HMM allows inconsistent static and dynamic features, as is shown in Figure 4. The piecewise
constant thick line at the top

stand
s

for static feature or state mean of each state. The
thin line at the
bottom stand
s

for dynamic feature, for example delta coefficients. The trajectory HMM for speech
recognition was derived by reformulating conventional HMM with static and dynamic observation
vectors (e.g. delta and delta
-
delta cepstral coe
fficients),
whereby
explicit relationship
is imposed
between static and dynamic features.


..
t
-
1
t
t
+
1
t
+
2

Figure 4 Inconsistency between static and dynamic features


From the speech production point of view, vocal organs move smoothly when
people articulate
because of the
physiological

limitation.
Intuitively
, the state mean trajectory should be a
continuous

line, as is shown in Figure 4, instead of a piecewise constant line sequence. To
mitigate

the
inconsistency

between piecewise constant
static and dynamic features, it is hoped that the dynamic
features
can be

incorporated into the static features hence to form an integrated optimal speech
parameter trajectory.


To derive the trajectory HMM, the first
step

is to map the static feature spa
ce to the output acoustic
feature space which
contain
s
both
static and dynamic features, for example delta and delta
-
delta
spectral
coefficients.

To begin with, assume an acoustic vector sequence
1 2
[,,...,]
T
y y y y
   


contains both static features
c
and dynamic features
c

,
2
c

, i.e. the
t
-
th acoustic
vector

2
[,,]
t t t t
y c c c
   
  
, where


means
matrix

transposition. The go
al is to find a mapping
function

between the static and dynamic
feature

vectors. Assume the static feature vector is
D
dimensional
i.e.







[ 1,2,...,]
t t t t
c c c c D


. The dynamic features is calculated by




PhD Three Month Report



8









1
1
1
L
t t
L
c w c







 



(1)









2
2
2
2
L
t t
L
c w c







 


(2)

Now the static features and dynamic features can be arranged in a matrix form as follows

y Wc


(3)

where

































1 2
1 2
0 1 2
1st -th
t-th
-th -th
[,,...,]
[,,...,]
[,,]
[0,...,0,,...,0,...,,0,...,0 ],0,1,2
n n
T
T
t t t t
n n n n n n
t D D D D D D D D D D D D D D
T
t L t L
c c c c
W w w w
w w w w
w w L I w I w L I n
 
   


        
 



  

and




0 0
0
L L
 
 
,




0
0 1
w

.


Th
e state output
probability

for the conventional HMM is







| |,|
x
P y M P y x M P x M




(4)

where


1 2
,,...,
T
x x x x


is a state sequence.

The next goal is to transform (4) into a form which
contains static features only. Assume each state output PDF
as single Gaussian, the
probability


|,
P y x M

then can be written as







1
|,|,|,
t t
T
t x x x x
t
P y x M y U y U
 

   


(5)

where
t
x

and
t
x
U
are the
3 1
M


mean vector and the
3 3
M M


covariance matrix of state
t
x
,
respectively, and

1 2
1 2
[,,...,]
diag[,,...,]
T
T
x x x x
x x x x
U U U U
   
   




Now rewrite (5) by substituting (3) for (5)








|,|,|,
x
x x x x
P Wc x M Wc U K c c Q

    


(6)

Where
x
c

is given by

x
x x
R c r






PhD Three Month Report



9

And







1
1
1
1
3
2 | |
1
exp
2
2 | |
x x
x x x
x x
DT
x
x x x x x x x
DT
x
R W U W
r W U
Q R
Q
K U r Q r
U


 

 
 

  



 
   
 
 


w
here
x
K

is the normalization constant
.


Finally, the trajectory model is defined by
eliminating the normalization constant
x
K

in (6)







| |,|
x
P c M P c x M P x M



where





|,|,
x
x
P c x M c c Q
 

It is demonstrated that the mean
x
c
obtained this way is exactly the same as the speech parameter
trajectory (Tokuda et al 2000), i.e.




arg max |,arg max ( |,)
x
c c
c P y x M P Wc x M
 


3 Project Aims and

the Initial Stages of Work


The ultimate goal of this project is to
develop more appropriate, sophisticated statistical method to
model speech dynamics. The framework chosen in this project is multiple
-
level segment HMM
(MSHMM).

It is hoped that the devel
opment of these models will be able to support both
high
accuracy speech and speaker recognition, and high
accuracy

speech synthesis.



Based on the research of last three months, the research question proposed here is: what is the best
way to model speech

dynamics under the frame work of multiple
-
layer segmental HMMs. As has
been discussed before, trajectory HMM defines a mapping function between static and dynamic
features, which is analogous to articulatory
-
to
-
acoustic mapping in MSHMM. It is assumed tha
t
positive outcome may take place if trajectory HMM is incorporated into the MSHMM framework,
i.e. modelling dynamics in the intermediate layer using trajectory HMM.




PhD Three Month Report



10

In the “Balthasar” project, speech dynamics were modelled as piecewise linear trajectorie
s in the
articulatory space and linearly mapped into the acoustic space. A piecewise linear model provides
an adequate ‘passive’ approximation to the formant trajectories, but does not capture the active
dynamics of the articulatory system (Russell 2004).

One of the objectives of this project, and
possibly

an initial stage of this project, is to conduct research on non
-
linear articulatory trajectories
modelling and the
incorporation of these trajectories
into
multiple
-
layer segment HMMs. The
improved model
will be tested and evaluated with
extensive

experiments for speech and speaker
recognition and speech synthesis.


4

Plans for Next Six Months


This section gives a plan
and timetable
for next six months.
A Gantt
chart

is represent below
showing an approxi
mate schedule. Basically, there are three main objectives in the next six months.
The first objective is to become
familiar

with HTK. A specific task is to use HTK to build a set of
monophone or triphone HMMs based on PFS control parameters. The expected d
uration for
completion of this objective is two months. The second objective is to write a programme to
synthesize a set of HMMs. It may take about one month to finish this job. Finally, half of next six
months time is given to the implementation of traje
ctory HMM algorithm as trajectory HMM is
highly related to the research question proposed before. Apart from the above objectives,
background reading will last throughout the whole period.






PhD Three Month Report



11

References


Digalakis, V. “Segment
-
Based stochastic models of s
pectral dynamics for continuous speech
recognition”, PhD Thesis, Boston University, 1992.

Gales, M.J.F. Young, S.J.

Segmental hidden markov models

,
Proceedings of Eurospeech

93
,
Berlin, Germany, pp. 1579
-
1582, 1993.

Ghitza, O. Sondhi, M.

Hidden markov m
odels with templates as non
-
stationary states: an
application to speech recognition

,
Computer. Speech Lang
. 2, 101
-
119, 1993.

Gish, H. NG, K.

A segmental speech model with applications to word spotting

,
Proc. IEEE. Int.
Con. Acoust,Speech Signal Process
ing
, Minneapolis, pp. 447
-
450, 1993.

Holmes, J.N. “Speech synthesis and recognition/ John Holmes and Wendy Holmes” 2
nd

edition,
Taylor & Francis
, 2001.

Holmes, W.J. Russell, M.J.

Probabilistic
-
trajectory segmental HMMs

,
Computer Speech and
Language

13, 3
-
37, 1999
.

Holmes, W.J.

“Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic
Speech Recognition”,
IMA Talks:
Mathematical Foundations of Speech Processing and
Recognition
,

September 18
-
22, 2000.

Holmes, J.N. Mattingly I.G. and Shearme J.N., “Speech synthesis by rule”, Language and Speech, 7,
pp. 127
-
143, 1966.

Levinson, S.

Continuously varable duration hidden Markov models for automatic speech
recognition

,
Computer. Speec
h Language
, vol.1, pp. 29
-
45, 1986.

Ostendorf, M., Digalakis, V. & Kimball, O. A. (1996), “From HMM’s to segment models: A
unified view of stochastic modelling for speech recognition”,
IEEE Transactions on Speech
and Audio Processing
4, 360
-
378, 1996.

Russ
ell, M.J. Jackson, P.J.B. “A multiple
-
level linear/linear segmental HMM with a formant
-
based
intermediate layer”,
Computer Speech and Language
, 19 (2): 205 225, 2005.

Russell, M.J. “A unified model for speech recognition and synthesis”,
the University of B
irmingham
,
2004.

Russell, M.J. Moore, R.

Explicit
modelling

of state occupancy in hidden Markov models for
automatic speech recognition

,
Proc. Int. Conf. Acoust., Seppch Signal Processing
, pp.
2376
-
2379
,
1985
.

Russell, M.J.

A segmental HMM for speech pat
tern modelling

,
Proceedings of the IEEE
International Conference on Acoustics Speech and Signal Processing, Minneapolis
, MN, pp.
499
-
502, 1993.

Richards, H.B. Bridle, J.S. “The HDM: a segmental Hidden Dynamic Model of coarticulation”,
Proceedings of the I
EEE
-
ICASSP
, Phoenix, AZ, pp. 357
-
360, 1999.

Tokuda, K. Zen, H
.

and Kitamura, T. “Trajectory modelling based on HMMs with the explicit
relationship between static and dynamic feature”, in Proc. Of Eurospeech 2003, pp. 865
-
868,
2003.

Tokuda, K. Yoshimura, T.

Masuko, T. Kobayashi, T. Kitamura, T.

Speech parameter generation
algorithms for HMM
-
based speech synthesis

,
Proc. ICASSP
, vol.3, pp. 1315
-
1318, 2000
.

Young, S.J. “Large vocabulary continuous speech recognition: a review”,
Proceedings of IEEE
workshop o
n Automatic Speech Recognition
, Snowbird, pp. 3
-
28, 1995.