PhD Three Month Report
1
PhD
Three Month Report
1 Introduction
Automatic Speech Recognition (ASR) is
mainly
concerned with
automatic
conversion
of an
acoustic
speech
signal into
a text
transcription
of that
speech
.
Many
approaches
to
automatic speech
recognition have been studi
e
d in the past
,
including approaches based on
Artificial Intelligence (AI)
,
pattern
template matching
techniques
and Artificial Neural Networks (ANN). Currently, the use of
Hidden Markov Models (HMMs)
is the most popular statistical approach to automatic sp
eech
recognition.
At present
,
all
state

of

the

art commercial and
most
laboratory speech recognition
systems
are based on HMMs approach
.
Generally speaking, the goal of the first three months is to gain a comprehensive perspective
throughout HMM

based sp
eech recognition techniques, from conventional HMMs
(Young 1995)
to
segment HMMs
(Holmes, Russell 1999)
and to the newly developed trajectory HMMs
(Tokuda et al
2000)
. Moreover, some highly relevant areas to speech recognition, for example, machine learning,
have also been explored during the past three months. As far as this report is concerned, the aim is
to provide a summary of background reading and literature survey done so far. More importantly,
an intended research question is proposed based on what ha
ve been reviewed on speech recognition.
Finally, a plan and timetable for the next six months are represented at the end of the report.
2 Literature Review
2.1 HMM

based speech recognition system overview
The performance of automatic speech recognition
systems has been greatly advanced as a result of
the
employment
of Hidden Markov Model
s
(HMM
s
) since 1980s.
HMMs assume an utterance as a
time

varying sequence of segments
. A
ll variations between
various
realizations of the
segments
which caused by
differe
nt speaking style, different speakers etc
are normally modelled
by using
multiple

component Gaussian mixture
s
.
Figure 1 below shows a typical phoneme

level HMM

based large vocabulary continuous speech recognition system.
It can be seen that the main
compon
ents of a speech recognition system
include
Front

end processing
model
, acoustic model,
language model and Viterbi decoder.
PhD Three Month Report
2
Front

end Processing
Speech Waveform
Acoustic vector
sequence
Y
=
{
y
1
,
y
2
,
…
,
y
T
}
Language
Model
Acoustic Model
(
Phoneme

level HMM store
)
Pronunciation
Dictionary
Viterbi Decoder
Optimal word
sequence
w
1
,
w
2
,
…
,
w
N
P
(
W
)
P
(
Y

W
)
Figure 1 Schematic diagram of a phoneme

level HMM

based speech recognition system
A
ny speech recognit
ion
begins with
automatic transformation of
the speech waveform into a
phonetically meaningful, compatible and compact
representation which is
appropriate
for
r
ecognition
.
This initial stage of speech recognition is referred to as
front

end processing. For
instance, the speech is firstly segmented, usually 20

30ms with 10ms overlap between
adjacent
segments. Each short speech segment is windowed and a Discrete Fourier Transform (DFT) is
thereafter applied. The resultant complex spectral coefficients are tak
en the modulus and logarithm
to produce the log power spectrum. Mel frequency averaging is then applied, followed by a Discrete
Cosine Transform (DCT) to
produce
a set of
M
el
F
requency
C
epstral
C
oefficients
(
MFCC
)
.
Alternative front

end analyses are also u
sed. Examples include linear prediction, Perceptual Linear
Prediction (PLP) etc.
MFCCs
are widely used as the acoustic feature vectors in speech recognition. However, they
do not
account for the underlying speech dynamics which cause speech changes. A sim
ple way to
capture
speech
dynamics is to
calculate
the first and sometimes the second time derivatives of
the static
feature vectors
and concatenate the static and dynamic feature vectors together
. Although the use of
dynamic features improves recognition
performance, the independence assumption actually
PhD Three Month Report
3
becomes even less valid because the observed data for any one frame are used to contribute to a
time span of several feature vectors (Holems, Russell 1999).
Given
a
sequence
of acoustic
vectors
1 2
,,...,
T
y y y y
, the speech recognition task is to find the
optimal words
sequence
1 2
,,...,
L
W w w w
,
which
maximizes
the
probability

P W y
,
i.e.
the
probability of the word sequence
W
given the acous
tic feature
vectors
y
. This probability is
normally computed by using Bayes Theorem:


P y W P W
P W y
P y
In the equation above,
P W
is the probability of the word
sequence
W
,
named language model
probability and computed using a language model.
The probability

P y W
is referred to as
acoustic model probability, which is the probability that the acoustic feature vector
y
is produced
by
the word
sequence
W
.

P y W
is calculated by using acoustic model.
Phoneme

level HMMs represent phoneme units instead of a whole word. The pronunciation
dictionary specifies the sequence of HMMs for each wo
rd in the vocabulary. For a given word
sequence, the pronunciation dictionary finds and concatenates corresponding phoneme HMMs for
each word in the sequence. The probabilities

P y W
and
P W
are then multiplied tog
ether using
the Viterbi decoder for all possible word sequences allowed by the language model. The one with
the maximum probability is selected as the
optimal word sequence
.
2.2
Segment HMMs
Although
conventional
HMM

based
systems
have achieved impress
ive performance on certain type
of speech recognition tasks, there have been discussions on its inherent deficiencies when applied to
speech. For example, HMMs assume independent observation vectors given the state sequence,
ignoring the fact that constrai
nts of articulation are such that any one frame of speech is highly
correlated with previous and following frames (Holmes, 2000). Generally speaking, the piecewise
stationarity assumption, the independence assumption and poor state duration modelling are t
hree
major limitations of HMMs for speech. These assumptions are made for mathematical tractability
and are clearly inappropriate for speech.
PhD Three Month Report
4
1
2
3
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
Acoustic feature
vectors
HMM states
Figure 2 Schematic diagram of a segment HMM
A variety of refined and alternative
models have been investigated to overcome limitations of
conventional HMMs. In particular, a number of “Segment HMMs” have been exploited, in which
variable

length sequence of observation vectors or ‘segments’ rather than individual observation are
associa
ted with models, as is shown in Figure 2.
A major motivation for the development of
“
segmental
”
acoustic models was the opportunity to exploit acoustic features which are apparent at
the segmental but not at the frame level (Holmes, Russell 1999).
Examples
of segment HMMs
include Digalakis (1992), Russell (1993) and Gales and Young (1993). Ostendorf, Digalakis and
Kimball (1996) represented a comprehensive review on a variety of segment HMMs.
Segment
HMMs have
several
advantages against conventional HMMs.
Firstly,
the length of
segments is variable, which allows
a
more realistic duration model to be easily incorporated.
The
state duration in a conventional HMM is
modelled
by a geometric distribution, which is unsuitable
for speech segment duration modellin
g. Other
relevant
duration models with explicit distribution of
state occupancy have been studied by Russell (1985) and Levi
nson (1986). Secondly, segment
HMMs allow relationship between feature vectors within a segment to be represented explicitly,
usuall
y
taking
some
form of
trajectories in the acoustic vector space. There are many different type
of trajectory

based segmental HMMs. Examples include the simplest trajectory form in which
the
trajectory is constant over time (Russell 1993; Gales, Young 1993)
,
the
linear trajectory
characterized by
a
mid

point and slope (Holmes, Russell 1999)
,
and
the
non

parametric trajectory
model (Ghitza and Sondhi 1993).
2.3
Multiple

level segment HMM (MSHMM)
It is generally believed that any type of incorporation of sp
eech dynamics into acoustic models will
eventually lead to improvements
in speech recognition performance.
What is
interesting and of
PhD Three Month Report
5
potential benefits
is to
model speech dynamics directly
within the existing HMMs framework and
hence retaining
the
advanta
ges of HMMs, for instance the well

studied training and decoding
algorithm. A good example is the “Balthasar” project conducted at the University of Birmingham
(Russell, Jackson 2005). The “Bathasar” project introduced a novel multiple

level segmental hid
den
Markov model (MSHMM)
, which
incorporates
an
intermediate ‘articulatory’ layer to regulate the
relationship between underlying states and surface acoustic feature vectors.
1
2
3
Acoustic layer
Articulatory

to

acoustic mapping
Intermediate layer
Finite state process
w
Figure 3 A multiple

level segmental model that us
es linear trajectories in an intermediate space.
Figure 3 shows
the
multiple

level segmental HMMs (MSHMM) proposed in the ‘Balthasar’ project.
It can be seen clearly that the relationship between the state

level and acoustic

level descriptions of
a speec
h signal is regulated by an intermediate, ‘articulatory’ representation (Russell, Jackson 2004).
T
o construct the ‘articulatory’ space
,
a
sequence of Parallel Formant Synthesiser (PFS) control
parameters is used.
A typical parallel formant synthesiser (H
olmes, Mattingley and Shearme 1966) consists of a series
of resonators arranged in parallel, usually four, representing the first four formant frequencies and
the corresponding amplitude control for each formant resonator plus a voicing control to determin
e
the type of excitations. A sequence of 12 dimensional vectors, including the frequencies and
amplitudes of the first three formants, the fundamental frequency and degree of voicing etc, is sent
every 10

20ms to drive the synthesiser.
It has been demonstr
ated that t
he derivation of appropriate
control parameters determines to what extent the synthesis quality would be.
PhD Three Month Report
6
The segmental model used in the
MSHMM
framework is a fixed, linear trajectory segmental HMM
(FT

SHMM) (Holmes, Russell 1999), in which eac
h acoustic segment is considered as a variable

duration, noisy function of a linear trajectory. The segment trajectory of duration
is characterized
by
f t t t m c
,
where
1
2
t
,
m
and
c
are slope vector and midpoint vector
respectively. Assume
1 1 2
[,,...,]
y y y y
as the acoustic vectors associated with
state
i
, the state
output
probability
is given as follows:
1
1
,
i i t i i
t
b y c y f t V
,
where
i
c
is the duration
PDF
and
,
t i i
y f t V
is the multivariate Gaussian
PDF
with mean
i
f t
and diagonal
N N
covariance
i
V
.
The novelty of the MSHMM is to define the trajectory
f t
in the intermediate or articulatory layer
and then map onto to the acoustic layer by using an
‘
articulatory

to

acoustic
’
mapping
function
W
,
i
.e.
1
1
,
i i t i i
t
b y c y W f t V
Here the mapping function
W
is piecewise linear.
The
linear mapping work has been doing by Miss
Zheng and research on non

linear mapping has already started.
The principle of Viterbi decoding stil
l applies to segmental HMMs, though the computational load
is more expensive than conventional HMM. The segmental Viterbi decoder is shown as below:
1
ˆ
ˆ
maxmax
t
t t ji i t
j
i j a b y
where
1
t
i t
b y
is the segmental state output probabilities.
Although multiple

layer segmental HMM is at very early stage, it provides a solid theoretical
foundation for the development of richer classes of multi

level models, including non

linear models
of dynamics and alternative articulatory representations (Ru
ssell, Jackson 2005). The incorporation
of an intermediate layer has many advantages. It introduces the possibility of better speech
dynamics modelling and articulatory

to

acoustic mapping
.
PhD Three Month Report
7
2.3
Trajectory HMM
Recently, a new type of model “Tajectory HMM
” was proposed by Tokuda, Zen and Kitamura
(2003). The trajectory HMM is based on the conventional HMMs whose acoustic feature vectors
include static and dynamic features. It intends to overcome the limitations of constant statistics
within an HMM state an
d independence assumption of state output probabilities. The conventional
HMM allows inconsistent static and dynamic features, as is shown in Figure 4. The piecewise
constant thick line at the top
stand
s
for static feature or state mean of each state. The
thin line at the
bottom stand
s
for dynamic feature, for example delta coefficients. The trajectory HMM for speech
recognition was derived by reformulating conventional HMM with static and dynamic observation
vectors (e.g. delta and delta

delta cepstral coe
fficients),
whereby
explicit relationship
is imposed
between static and dynamic features.
…
..
t

1
t
t
+
1
t
+
2
Figure 4 Inconsistency between static and dynamic features
From the speech production point of view, vocal organs move smoothly when
people articulate
because of the
physiological
limitation.
Intuitively
, the state mean trajectory should be a
continuous
line, as is shown in Figure 4, instead of a piecewise constant line sequence. To
mitigate
the
inconsistency
between piecewise constant
static and dynamic features, it is hoped that the dynamic
features
can be
incorporated into the static features hence to form an integrated optimal speech
parameter trajectory.
To derive the trajectory HMM, the first
step
is to map the static feature spa
ce to the output acoustic
feature space which
contain
s
both
static and dynamic features, for example delta and delta

delta
spectral
coefficients.
To begin with, assume an acoustic vector sequence
1 2
[,,...,]
T
y y y y
contains both static features
c
and dynamic features
c
,
2
c
, i.e. the
t

th acoustic
vector
2
[,,]
t t t t
y c c c
, where
means
matrix
transposition. The go
al is to find a mapping
function
between the static and dynamic
feature
vectors. Assume the static feature vector is
D
dimensional
i.e.
[ 1,2,...,]
t t t t
c c c c D
. The dynamic features is calculated by
PhD Three Month Report
8
1
1
1
L
t t
L
c w c
(1)
2
2
2
2
L
t t
L
c w c
(2)
Now the static features and dynamic features can be arranged in a matrix form as follows
y Wc
(3)
where
1 2
1 2
0 1 2
1st th
tth
th th
[,,...,]
[,,...,]
[,,]
[0,...,0,,...,0,...,,0,...,0 ],0,1,2
n n
T
T
t t t t
n n n n n n
t D D D D D D D D D D D D D D
T
t L t L
c c c c
W w w w
w w w w
w w L I w I w L I n
and
0 0
0
L L
,
0
0 1
w
.
Th
e state output
probability
for the conventional HMM is
 ,
x
P y M P y x M P x M
(4)
where
1 2
,,...,
T
x x x x
is a state sequence.
The next goal is to transform (4) into a form which
contains static features only. Assume each state output PDF
as single Gaussian, the
probability
,
P y x M
then can be written as
1
,,,
t t
T
t x x x x
t
P y x M y U y U
(5)
where
t
x
and
t
x
U
are the
3 1
M
mean vector and the
3 3
M M
covariance matrix of state
t
x
,
respectively, and
1 2
1 2
[,,...,]
diag[,,...,]
T
T
x x x x
x x x x
U U U U
Now rewrite (5) by substituting (3) for (5)
,,,
x
x x x x
P Wc x M Wc U K c c Q
(6)
Where
x
c
is given by
x
x x
R c r
PhD Three Month Report
9
And
1
1
1
1
3
2  
1
exp
2
2  
x x
x x x
x x
DT
x
x x x x x x x
DT
x
R W U W
r W U
Q R
Q
K U r Q r
U
w
here
x
K
is the normalization constant
.
Finally, the trajectory model is defined by
eliminating the normalization constant
x
K
in (6)
 ,
x
P c M P c x M P x M
where
,,
x
x
P c x M c c Q
It is demonstrated that the mean
x
c
obtained this way is exactly the same as the speech parameter
trajectory (Tokuda et al 2000), i.e.
arg max ,arg max ( ,)
x
c c
c P y x M P Wc x M
3 Project Aims and
the Initial Stages of Work
The ultimate goal of this project is to
develop more appropriate, sophisticated statistical method to
model speech dynamics. The framework chosen in this project is multiple

level segment HMM
(MSHMM).
It is hoped that the devel
opment of these models will be able to support both
high
accuracy speech and speaker recognition, and high
accuracy
speech synthesis.
Based on the research of last three months, the research question proposed here is: what is the best
way to model speech
dynamics under the frame work of multiple

layer segmental HMMs. As has
been discussed before, trajectory HMM defines a mapping function between static and dynamic
features, which is analogous to articulatory

to

acoustic mapping in MSHMM. It is assumed tha
t
positive outcome may take place if trajectory HMM is incorporated into the MSHMM framework,
i.e. modelling dynamics in the intermediate layer using trajectory HMM.
PhD Three Month Report
10
In the “Balthasar” project, speech dynamics were modelled as piecewise linear trajectorie
s in the
articulatory space and linearly mapped into the acoustic space. A piecewise linear model provides
an adequate ‘passive’ approximation to the formant trajectories, but does not capture the active
dynamics of the articulatory system (Russell 2004).
One of the objectives of this project, and
possibly
an initial stage of this project, is to conduct research on non

linear articulatory trajectories
modelling and the
incorporation of these trajectories
into
multiple

layer segment HMMs. The
improved model
will be tested and evaluated with
extensive
experiments for speech and speaker
recognition and speech synthesis.
4
Plans for Next Six Months
This section gives a plan
and timetable
for next six months.
A Gantt
chart
is represent below
showing an approxi
mate schedule. Basically, there are three main objectives in the next six months.
The first objective is to become
familiar
with HTK. A specific task is to use HTK to build a set of
monophone or triphone HMMs based on PFS control parameters. The expected d
uration for
completion of this objective is two months. The second objective is to write a programme to
synthesize a set of HMMs. It may take about one month to finish this job. Finally, half of next six
months time is given to the implementation of traje
ctory HMM algorithm as trajectory HMM is
highly related to the research question proposed before. Apart from the above objectives,
background reading will last throughout the whole period.
PhD Three Month Report
11
References
Digalakis, V. “Segment

Based stochastic models of s
pectral dynamics for continuous speech
recognition”, PhD Thesis, Boston University, 1992.
Gales, M.J.F. Young, S.J.
“
Segmental hidden markov models
”
,
Proceedings of Eurospeech
’
93
,
Berlin, Germany, pp. 1579

1582, 1993.
Ghitza, O. Sondhi, M.
“
Hidden markov m
odels with templates as non

stationary states: an
application to speech recognition
”
,
Computer. Speech Lang
. 2, 101

119, 1993.
Gish, H. NG, K.
“
A segmental speech model with applications to word spotting
”
,
Proc. IEEE. Int.
Con. Acoust,Speech Signal Process
ing
, Minneapolis, pp. 447

450, 1993.
Holmes, J.N. “Speech synthesis and recognition/ John Holmes and Wendy Holmes” 2
nd
edition,
Taylor & Francis
, 2001.
Holmes, W.J. Russell, M.J.
“
Probabilistic

trajectory segmental HMMs
”
,
Computer Speech and
Language
13, 3

37, 1999
.
Holmes, W.J.
“Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic
Speech Recognition”,
IMA Talks:
Mathematical Foundations of Speech Processing and
Recognition
,
September 18

22, 2000.
Holmes, J.N. Mattingly I.G. and Shearme J.N., “Speech synthesis by rule”, Language and Speech, 7,
pp. 127

143, 1966.
Levinson, S.
“
Continuously varable duration hidden Markov models for automatic speech
recognition
”
,
Computer. Speec
h Language
, vol.1, pp. 29

45, 1986.
Ostendorf, M., Digalakis, V. & Kimball, O. A. (1996), “From HMM’s to segment models: A
unified view of stochastic modelling for speech recognition”,
IEEE Transactions on Speech
and Audio Processing
4, 360

378, 1996.
Russ
ell, M.J. Jackson, P.J.B. “A multiple

level linear/linear segmental HMM with a formant

based
intermediate layer”,
Computer Speech and Language
, 19 (2): 205 225, 2005.
Russell, M.J. “A unified model for speech recognition and synthesis”,
the University of B
irmingham
,
2004.
Russell, M.J. Moore, R.
“
Explicit
modelling
of state occupancy in hidden Markov models for
automatic speech recognition
”
,
Proc. Int. Conf. Acoust., Seppch Signal Processing
, pp.
2376

2379
,
1985
.
Russell, M.J.
“
A segmental HMM for speech pat
tern modelling
”
,
Proceedings of the IEEE
International Conference on Acoustics Speech and Signal Processing, Minneapolis
, MN, pp.
499

502, 1993.
Richards, H.B. Bridle, J.S. “The HDM: a segmental Hidden Dynamic Model of coarticulation”,
Proceedings of the I
EEE

ICASSP
, Phoenix, AZ, pp. 357

360, 1999.
Tokuda, K. Zen, H
.
and Kitamura, T. “Trajectory modelling based on HMMs with the explicit
relationship between static and dynamic feature”, in Proc. Of Eurospeech 2003, pp. 865

868,
2003.
Tokuda, K. Yoshimura, T.
Masuko, T. Kobayashi, T. Kitamura, T.
“
Speech parameter generation
algorithms for HMM

based speech synthesis
”
,
Proc. ICASSP
, vol.3, pp. 1315

1318, 2000
.
Young, S.J. “Large vocabulary continuous speech recognition: a review”,
Proceedings of IEEE
workshop o
n Automatic Speech Recognition
, Snowbird, pp. 3

28, 1995.
Comments 0
Log in to post a comment