A novel approach to HMM-based speech recognition systems using ...

familiarspottyΒιοτεχνολογία

10 Δεκ 2012 (πριν από 4 χρόνια και 6 μήνες)

307 εμφανίσεις

Mathematical and Computer Modelling 52 (2010) 19101920
Contents lists available at
ScienceDirect
Mathematical and Computer Modelling
journal homepage:
www.elsevier.com/locate/mcm
A novel approach to HMM-based speech recognition systems using
particle swarmoptimization
Negin Najkar
a
,

,Farbod Razzazi
a
,Hossein Sameti
b
a
Department of Electrical Engineering,Faculty of Engineering,Islamic Azad University,Science and Research Branch,Tehran,Iran
b
Department of Computer Engineering,Sharif University of Technology,Tehran,Iran
a r t i c l e i n f o
Article history:
Received 23 September 2009
Received in revised form25 December 2009
Accepted 31 January 2010
Keywords:
Hidden Markov model (HMM)
Particle swarmoptimization (PSO)
HMM-based speech recognition
Viterbi algorithm
a b s t r a c t
The main core of HMM-based speech recognition systems is Viterbi algorithm.Viterbi
algorithmuses dynamic programming to find out the best alignment between the input
speech and a given speech model.In this paper,dynamic programming is replaced by a
searchmethodwhichis basedonparticle swarmoptimizationalgorithm.The major idea is
focused on generating an initial population of segmentation vectors in the solution search
space and improving the location of segments by an updating algorithm.Several methods
are introduced and evaluated for the representation of particles and their corresponding
movement structures.In addition,two segmentation strategies are explored.The first
method is the standard segmentation which tries to maximize the likelihood function for
each competing acoustic model separately.In the next method,a global segmentation
tied between several models and the system tries to optimize the likelihood using a
common tied segmentation.The results showthat the effect of these factors is noticeable
in finding the global optimumwhile maintaining the systemaccuracy.The idea was tested
on an isolated word recognition and phone classification tasks and shows its significant
performance in both accuracy and computational complexity aspects.
'2010 Elsevier Ltd.All rights reserved.
1.Introduction
Hidden Markov model (HMM) is the base of a set of successful techniques for acoustic modeling in speech recognition
systems.The mainreasons for this success are due tothis model's analytic abilityinthe speechphenomenonandits accuracy
in practical speech recognition systems.Another major specification of HMM is its convergent and reliable parameter
training procedure.Spoken utterances are represented as a non-stationary sequence of feature vectors.Therefore,to
evaluate a speech sequence statistically,it is required to segment the speech sequence into stationary states.An HMM
model is a finite state machine.Each state may be modeled as a single Gaussian or a multi-modal Gaussians mixture.Due
to the continuous nature of speech observations,continuous density pdfs are often used in this model.The topology of an
HMMmodel for speech is considered to be left-to-right to meet the observations arrangement criterion.This left-to-right
topology authorizes transitions fromeach state to itself and to right-hand neighbors.HMMmodel parameters are usually
estimated in the training phase by maximumlikelihood based [
1
] or discriminative based training algorithms [
2
,
3
] using
sufficient training data sets.A continuous left-to-right HMMmodel parameters with N states and M mixtures can be stated
by  D f;A;Bg. D f
i
g is the initial state distribution matrix,and A D fa
ij
g is the state transition probability distribution
matrix.The transition probabilities are defined as follows.a
ij
D PTq
tC1
D jjq
t
D iU is the transition probability fromstate i

Corresponding author.
E-mail addresses:
nnajkar@gmail.com
(N.Najkar),
razzazi@sr.iau.ac.ir
(F.Razzazi),
sameti@sharif.edu
(H.Sameti).
0895-7177/$  see front matter'2010 Elsevier Ltd.All rights reserved.
doi:10.1016/j.mcm.2010.03.041
N.Najkar et al./Mathematical and Computer Modelling 52 (2010) 19101920 1911
Fig.1.The overall block diagramof an automatic speech recognition system.
at time t to state j at time t C1 satisfying the following constraints:
a
ij
 0;
N
X
iD1
a
ij
D 1I 1  i;j  N:(1)
B D fb
j
.o
t
/g is the set of observation probability density per state,which may be represented by a multi-modal Gaussian
mixture model as
b
j
.o
t
/D
M
X
mD1
C
jm
G.o
t
;
jm
;
jm
/(2)
where C
jm
is the mixture coefficient for the mth mixture in state j.C
jm
satisfies the following constraints:
C
jm
 0;
M
X
mD1
C
jm
D 1I 1  j  N;1  m  M:(3)
G.:/is a Gaussian distribution with mean vector 
jm
and covariance matrix 
jm
.
Fig.1
shows the overall block diagram of an automatic speech recognition system in the recognition phase.The
continuous input speech utterance is segmented into frames by the preprocessing module.In the next step,the feature
extraction module extracts a feature vector on each frame to represent its acoustic information.Hence,a discrete sequence
of feature vectors (observations),O D.o
1
o
2
:::o
T
/,is obtained.In an utterance classification task with vocabulary size v,
the unknown input speech is compared with all of the HMMs 
i
according to some search algorithms,and finally,the input
speechis identifiedas one of the reference HMMs withthe highest score.Inmost HMM-basedsystems,Viterbi algorithm[
1
]
is the core of the recognition procedure.Viterbi algorithmis a full search method that tries all possible solutions to find the
best alignment path of the state sequence between the input utterance and a given HMM.The full search in HMMcan be
formulated as
LL D P.Oj/D max
q
1
q
2
:::q
T
PTq
1
q
2
:::q
T
;o
1
o
2
:::o
T
jU
D max
q
1
q
2
:::q
T
PT
q
1
b
q
1
.o
1
/
T
Y
tD2
a
q
t1
q
t
b
q
t
.o
t
/U (4)
where q
t
is the state at time t.The sequence q
1
q
2
:::q
t
denotes an alignment of observation sequence and speech HMM
and T is the length of the observation sequence.Obviously,as the search space increases,the computational cost increases
exponentially with O.N
T
/;therefore,it is impractical to solve this NP-complete problem.Viterbi algorithm extracts the
alignment path dynamically by a recursive procedure.
LL
t
.j/D max
1iN
TLL
t1
.i/a
ij
Ub
j
.o
t
/(5)
where LL
t
.j/is the partial cost function of the alignment path in state j at time t and LL
t1
.i/is the score of the best path
among possible paths that start fromfirst state and end in the ith state at time t 1.
Fig.2
shows a Viterbi trellis diagramin
which the horizontal axis represents the time axis of the input utterance and the vertical axis represents the possible states
of the reference HMM.
The computational complexity of this method is O.N
2
T/.Although it saves the computational cost and memory
requirements,it can however only be practically used where the length of the input utterance is short and the number
of HMMreference models is small.In particular,for continuous speech recognition,this is not usually the case.Hence,to
overcome this deficiency,a Viterbi beamsearch [
4
] has been presented.The main idea in beamsearch is to keep and extend
possible paths with higher scores.This approach may eliminate the optimality of the algorithm.
1912 N.Najkar et al./Mathematical and Computer Modelling 52 (2010) 19101920
Fig.2.Viterbi trellis diagram.
Recently,evolutionary algorithms (EAs) have been extended in speech recognition problems.However,there is little
research in using these algorithms in the recognition phase of HMM-based recognizers,and most studies have been
focused on the training phase.EAs are based on generating a group of randompopulation of possible solutions and using a
collaborative search in each generation to achieve better solutions than previous ones.In HMMtraining,genetic algorithms
(GAs) [
59
] andparticleswarmoptimization(PSO) [
1012
] havebeenstudiedinrecent years,whereeachindividual solution
is represented as an HMMand is encoded as a string of real numbers.The studies have revealed that PSO can yield better
recognition performance and more capability to find the global optimum in comparison with GA and the well-known
BaumWelch algorithm.These algorithms have also been applied in optimizing the nonlinear time alignment of template-
based speech recognition in the recognition phase [
13
,
14
].In these works,to solve the optimal warping path searching
problem,each potential solution is considered as a possible warping path in the search space.It was shown that using
PSO with a pruning strategy causes a considerable reduction in recognition time while maintaining the systemaccuracy.
In contrast,using a direct GA without pruning is not a promising approach.PSO has been used to solve many NP-complete
optimization problems [
15
,
16
].
Inthis paper,anovel approachis proposedtoapplyparticleswarmoptimizationstrategyintherecognitionphaseof aspeech
recognition systeminstead of the traditional Viterbi algorithmto deal with PSOperformance in finding the global optimum
segmentation.Preliminary results of this work were reported in [
17
].To explore the performance of the proposed system
performance,experiments were conducted on isolated word recognition and stop consonants phones.Stop consonants
classification is one of the most challenging tasks in speech recognition.In addition,a new classification method based
on a tied segmentation strategy is introduced.The method can be generalized to the continuous speech recognition case.
The remainder of this paper is organized as follows.The next section provides the details of the proposed PSO-based
recognition procedure.Section
3
presents the experimental results and in the last section the paper is concluded.
2.PSOtrellis recognition approach
Particle swarm optimization algorithm was originally introduced by Eberhart and Kennedy [
18
] in 1995.PSO is a
population-based evolutionary algorithm in which the algorithm is initialized with a population of random solutions
(particles),which are then flown through the hyperspace.To solve an optimization problem by the PSO algorithm,the
problemshould be appropriately modeled and mapped to PSO notation before the start of the optimization.Each particle
can be represented as a multidimensional vector X
i
D.x
i1
x
i2
:::x
in
/,where the vector elements and the fitness function
of each particle are determined depending on the problem domain.Each particle keeps track of its coordinates in the
hyperspace ineachiterationwhichare associatedwith the best solutionit has achievedsofar,whichis called pbest.pbest
i
D
.p
i1
p
i2
:::p
in
//and also keeps the overall best location obtained thus far by any particles in the population,which is called
gbest:gbest D.g
1
g
2
:::g
n
/.Therefore,the position of each particle changes toward pbest and gbest based on the particle
velocity,which is obtained in the next iteration as follows:
V
kC1
i
D!V
k
i
Cr
1
.pbest
k
i
X
k
i
/Cr
2
.gbest
k
X
k
i
/(6)
X
kC1
i
D X
k
i
CV
kC1
i
(7)
where!is the inertia weight. and  are constants which guide the particles towards the improved positions.r
1
and r
2
are
uniformly distributed randomvariables between 0 and 1 and k denotes the evolution iteration.PSOalgorithmis terminated
N.Najkar et al./Mathematical and Computer Modelling 52 (2010) 19101920 1913
Fig.3.The concept of modification of searching points in the solution space.
Fig.4.An example of a four-state trellis diagramwith representation segmentation vector and state sequence vector.
after a maximal number of iterations or when the best particle position of the entire swarmcannot be improved further
after sufficient iterations.
Fig.3
shows the concept of modification of searching points through iterations.
2.1.Defining particles
To evaluate the PSO-based recognizer systemperformance,two methods were applied in defining particles.In the first
method (SS),each particle could be one of the possible allowable state sequences in the trellis of problemsolution space.
The trellis diagram,as is shown in
Fig.4
,is a state transition diagram which represents all of the possible states over a
sequence of time intervals.
An allowable path is specified by two properties.In the first,the path should be left-to-right in the sequence of state (i.e.as
the time increases,the state index increases).The second property may be stated as follows:each path starts fromthe first
state of each HMMand may (or may not) lead to the final state.
However,as we will showin the next section,this method for defining the particles leads to a local optimumsolution,while
this same technique has been used in [
14
] and provided good performance,which indicates the importance of particle
representation based on the problem.
The major idea of the second method (Segment) is focused on correcting and improving on the utterance segmentation.
This definition of particles gives better results in finding the global optimumsolution.In this procedure,each particle is a
segmentation vector,that is,its components are the transition locations fromone segment to another.Hence,the length of
the particle vectors reduces and is equal to the corresponding HMMsegments.Consequently,due to the reduction of the
search space,the exploration search may be better performed.If some repetitive elements are placed in the segmentation
vector,it is the representation of the number of jumps that occurs in the state sequence.This repetitive element is located
to have a constant length path vector that is essential in defining a proper movement of particles.
1914 N.Najkar et al./Mathematical and Computer Modelling 52 (2010) 19101920
Table 1
Movement structure.
Conditions Positions
x <  Move in direction throughpbest
x  ;x <  C Move in direction throughgbest
x   C;x <  C C Move in randomdirection
Else Staying at the current position
In the first type of particle definition,the length of the path vector.X
i
D.q
i1
q
i2
:::q
it
:::q
iT
//is equal to the duration of the
utterance.In the second one,the length of each segmentation vector.X
i
D.s
i1
s
i2
:::s
in
:::s
iN
//is equal to the number of
reference HMMstates.
Fig.4
shows both types of generating particle.
2.2.Fitness function
To evaluate each particle in both mentioned methods,a unique fitness function is used.However,the particles which are
represented as segmentation vectors should be represented as their corresponding state sequence to calculate the fitness.
We can rewrite Eq.
(4)
in the logarithmic domain over one of the possible paths as a fitness function:
LL.X
i
/D log.b
1
.o
1
//C
T
X
tD2
flog.a
q
i.t1/
q
it
/C.log.b
q
it
.o
t
///g (8)
where LL.X
i
/is the ith particle log-likelihood,a
q
i.t1/
q
it
is the transition probability fromstate q
i.t1/
to state q
it
and b
q
it
.o
t
/
is the function of producing the observation o
t
in state q
it
.Since all paths start fromthe first state in the search space,the
first element of matrix  is one (in the logarithmic domain it is zero).Therefore,the first termof Eq.
(4)
has been neglected
in Eq.
(8)
.
2.3.Movement structures
To update the positionof a particle at eachtime instant,two methods are commonly applied.Inthe first method(LCPSO),
the particle movement is modified as follows:
V
kC1
i
D .pbest
k
i
X
k
i
/C.gbest
k
X
k
i
/C .randX
k
i
X
k
i
/(9)
X
kC1
i
D X
k
i
CV
kC1
i
:(10)
This method is similar to the mentioned standard PSO algorithm;however,we added a random term to make the
algorithmrobust against a local optimumand omitted the inertia weight that scales the previous time step velocity.We
can rewrite Eqs.
(9)
and
(10)
as
X
kC1
i
D.1 . C C //X
k
i
Cpbest
k
i
Cgbest
k
i
C randX
k
i
:(11)
The coefficients , and that scale the influence of pbest,gbest and the random term should satisfy the following
constraints:
 C C <1I ;; >0:(12)
These conditions guarantee the consistency of particles during generations which represent a left-to-right continuous
warping path,because,when the velocity update rule of SPSO is used for this isolated problem,it is possible to generate
a discontinuous path.This method leads to a local optimumwhile the second method (ProbPSO) is a solution to avoid this
problem.We usedthe updating methodbefore in[
14
];it is explainedin
Table 1
,where ,, are three constant parameters
that have been introduced to move a particle.Amovement is carried out due to the value of a uniformly distributed random
parameter x between zero and one.
2.4.Segmentation strategies
Inthetiedsegmentationmethod,anewmethodfor classificationhas beenpresented.Intheprivatesegmentationmethod
(
Fig.1
),for assessing the likelihood of input speech to each of the classes,the PSO-based algorithmis simulated separately
for each class.In each class,particles modify their positions by the gbest which is obtained on the corresponding class in
every iteration.However,in the tied segmentation method,a global segmentation is achieved by comparing the particles
score of all classes.The particles are updated in the direction of global segmentation.This comparing method causes the
score of the true class including this global segmentation to increase during generations and the score of the other classes
with this segmentation to decrease.However,the Viterbi algorithmis not able to handle this flexibility.
2.5.Computational complexity
The computational complexity of Segment-LCPSO is similar to that of the standard PSO algorithm.The Segment-LCPSO
algorithm traps into a local optimum;furthermore,it is a very slow process for this recognition task.However,in the
N.Najkar et al./Mathematical and Computer Modelling 52 (2010) 19101920 1915
Table 2
Implementation parameters of isolated word recognizer.
Test words#States#Mixtures Frame shift (ms) Frame length (ms)
=water=;=like=;=year=,=wash=;=greasy=,=dark=;=carry=;=oily= 4 9 12 20
Table 3
Implementation parameters of phone classifier.
Test stop phones#States#Mixtures Frame shift (ms) Frame length (ms)
=b=;=d=;=g=,=p=;=t=;=k= 3 16 10 16
proposed Segment-ProbPSO algorithm,due to its probability-based movement structure,the computational complexity is
less than that of the standard PSOalgorithm.To save the computational cost in generating initializing particles,we produce
a pre-saved look-up table of possible particles.In addition,we re-use the look-up table to move the particles in a random
direction.In summary,most of the calculations are focused on the fitness function computation (Eq.
(8)
).Since each particle
is evaluated at every iterationand inadditionsummations are proportional to the lengthof input speech,the computational
order of the proposed Segment-ProbPSO is O(M.I.T),where M is the population size,I is the required generation number
and T is the length of observation vector.It outperforms the classic Viterbi algorithmif the following constraint is satisfied:
M:I < N
2
.
3.Experimental results
3.1.Experimental setup
To evaluate the performance of the idea in an utterance classifier,a set of experiments was conducted on the eight most
frequently occurring words of standard TIMIT speech database which are presented in
Table 2
.The frequency of each word
is more than 460 in the training set and the words frequency is more than 160 in the test set.Although TIMIT is a continuous
speech recognition benchmark,the variety of words and speakers in TIMIT makes it a good benchmark for our task.
In addition,the idea was evaluated on a stop consonants phone classifier for six TIMIT stop phones which are presented
in
Table 3
.We eliminated the phones with length less than three frames in the test set.The total number of phones in the
resulting test set for six stop phones was 5184 phones.The overall block diagramof both HMM-based recognition systems
is similar to
Fig.1
.The BaumWelch algorithmwas applied to train a continuous density HMMof each word and phone.The
numbers of states in each word and each phone were assumed to be four and three,respectively.There are 9 mixtures per
state in the words model and 16 mixtures per state in the phones model.
In the preprocessing stage of both systems,the audio signal is transformed into 26 MFCC feature vectors.In the word
recognizer,featurevectors areextractedin20ms windows of theutteranceusingoverlapped8ms slidingframes.Incontrast,
in our phone classification test bed,the preprocessor produced the feature vectors every 10 ms for 16 ms length windows.
The first 12 features are based on 25 mel-scaled filter bank coefficients,the 13th element is a log energy coefficient and
the 13 remaining features are their first derivatives.The tests were simulated using Matlab 7.6 programming language.A
summary of implementation parameters is given in
Tables 2
and
3
and
Fig.5
shows the block diagramof our system.
3.2.Isolated word recognizer results
Figs.6
and
7
describe the overall behavior of the suggested system.
Fig.6
shows the effect of particles defining on the
convergence ratio with respect to Viterbi path likelihood.When the particles are presented as state sequence vectors,the
initial step starts fromlower ratios.In addition,the recognition systemis easily trapped into a local optimum.
The experiments show that if particles are defined as segmentation vectors and movement updating is considered as
being by the second method in Section
2.3
,the probability of finding the global optimum and approaching Viterbi path
likelihood increases.Therefore,Segment-ProbPSO is determined as the baseline method in all of the experiments and
optimizations have also been performed using this method.
Fig.7
shows the effect of movement structure.Although both curves start from a common point,their convergence
rates are different.Therefore,if the particles movement during generations is defined as a probabilistic structure,it is more
probable to find the global optimum.
The results in
Table 4
reveal that Viterbi and Segment-ProbPSOalgorithms are equal in error rates on average.Although
we have applied the recognition procedure of Viterbi algorithm for our system as the benchmark,this method provides
better results in some cases.This statement indicates the major drawback of the traditional recognition process that makes
the decision based on the comparison of best paths between unknown uttered word and given word models.
Fig.8
shows an example of a comparison between Viterbi and segment-ProbPSO recognition processes in 10 iterations.
The unknown input utterance belongs to word model 1.This test sample is recognized by Viterbi algorithmcorrectly while
1916 N.Najkar et al./Mathematical and Computer Modelling 52 (2010) 19101920
Fig.5.Speech recognizer block diagram.
100
99
98
97
96
95
94
93
92
(LL.Viterbi/LL.PSO)*100
Segment. ProbPSO
SS. ProbPSO
5 10 15 20 25 30 35 40
Iteration
Fig.6.Particles defining influence in convergence percentage to Viterbi path likelihood.
by the proposed algorithmit is recognized as the second word model after 10 iterations.However,it is obvious that,after
more generations,the system achieves the correct result.Therefore,more iteration was required for obtaining sufficient
N.Najkar et al./Mathematical and Computer Modelling 52 (2010) 19101920 1917
100
99
98
97
96
95
94
(LL.Viterbi/LL.PSO)*100
5 10 15 20 25 30 35 40
Iteration
Segment. ProbPSO
SS. LCPSO
Fig.7.Movement construction influence in convergence percentage to Viterbi path likelihood.
Table 4
Comparison error rates.
#Reference words Viterbi error rate (%) Segment-ProbPSO error rate (%)
2 0.86 0.86
4 0.58 0.44
6 0.87 0.66
8 0.73 0.73
convergence.In most cases,the difference between the competing models likelihood is large and desirable accuracy will be
reached in primary iterations.
Figs.9
and
10
report the systemoptimization results.
Fig.9
shows the influence of the , and coefficients on the
recognition error rate under fixed conditions for eight reference word models in 20 iterations.The optimumvalue of  is 5
for  D 15 and D 15.The optimumvalue of  is 5 for  D 5, D 15 and the optimumvalue of is 10 for  D 5 and
 D 5.
Fig.10
shows the effect of population size on the overall error rate.The computational cost increases with the increase
of population size.Therefore,the population size's optimumvalue is the position where the curve is saturated.This value is
about eight particles in the empirical curve depicted in
Fig.10
.
3.3.Phone recognition results
Considering the good performance of Segment-ProbPSO on an isolated word recognition system,some phone
classification experiments were conducted using Segment-ProbPSO with the optimum achieved values for , and in
the previous section.The results are depicted in
Fig.11
and
Table 5
which are compatible with the stops recognition results
in the literature [
19
].
Fig.11
shows the outstanding performance of this recognizer,which is even better than the PSO-based isolated word
recognizer inachievingtheglobal optimum.Furthermore,it is obvious that,intheinitial iteration,the gbest likelihoodvalues
of phone classes got close enoughtoViterbi's best pathvalue.Inaddition,the systemerror rate inthe first iteration,as shown
in
Table 5
,is 32:45%,which has negligible difference in comparison with the 30:29% baseline systemerror rate based on
Viterbi recognitionprocedure.Therefore,withafewadditional iterations,thesystemcaneasilyachievethedesiredaccuracy.
Table 5
shows the variations of systemerror rate versus different population sizes and iterations.If we neglect 1% or 2%
difference in error rate,we can claimthat,in the phone classification task,the proposed algorithmcomputational cost is
almost equivalent or even less than the computational size of Viterbi algorithm.
3.4.Tied segmentation method results
The results of the tiedsegmentationmethodin
Table 6
showthat bothclassifier types have almost the same performance.
However,in this method,the convergence rate to the desired accuracy rate is more than previous proposed methods.
1918 N.Najkar et al./Mathematical and Computer Modelling 52 (2010) 19101920
920
900
880
860
840
820
800
780
-Likelihood
-Likelihood
-Likelihood
0 2 4 6 8 10 12
Iteration
0 2 4 6 8 10 12
Iteration
0 2 4 6 8 10 12
Iteration
0 2 4 6 8 10 12
Iteration
reference1LL.Segment.ProbPSO
reference1LL.Viterbi
reference2LL.Segment.ProbPSO
reference2LL.Viterbi
reference4LL.Segment.ProbPSO
reference4LL.Viterbi
reference3LL.Segment.ProbPSO
reference3LL.Viterbi
851.7
1520
1500
1480
1460
1440
1420
1400
-Likelihood
1449
865
860
855
850
1000
980
960
940
920
900
880
851.6
898
Fig.8.An example of comparison of likelihood values versus number of iterations for Viterbi and Segment-ProbPSO methods for four reference word
models.
β Influence
α Influence
γ Influence
1.05
1
0.95
0.9
0.85
0.8
0.75
0.7
ErrorRate (%)
5 10 15 20 25 30 35 40 45 50
α, β, γ
Fig.9.The influence of , and on the recognition error rate for eight reference word models in 20 iterations.
Initial Population Influence
Initial Population
0 2 4 6 8 10 12 14 16 18
ErrorRate(%)
3.5
3
2.5
2
1.5
1
0.5
Fig.10.The effect of population size on recognition error rate for 20 iterations and D  D 5 and D 10.
N.Najkar et al./Mathematical and Computer Modelling 52 (2010) 19101920 1919
0 5 10 15 20 25 30 35 40
Iteration
100
99.5
99
98.5
(LL.Viterbi/LL.PSO)*100
Fig.11.Convergence percentage of Segment-ProbPSO's gbest likelihood values to the Viterbi path likelihood.
Table 5
Phone classifier error rates (%).
#Iterations
1 2 3 4 5 6
#Particles
1 39.12 36.94 35.63 34.88 34.22 33.12
2 36.65 34.86 33.72 33.04 32.64 32.31
3 35.38 33.35 32.48 31.85 31.58 31.46
4 34.26 32.54 31.98 31.69 31.27 31.15
5 33.04 32.37 31.69 31.19 30.92 30.90
6 32.45 31.79 31.53 30.74 30.63 30.55
Table 6
Phone classifier error rates (%).
#Iterations
1 2 3 4 5 6
#Particles
1 39.12 37.58 36.34 34.68 33.97 33.02
2 36.65 34.95 33.31 32.91 32.43 31.91
3 35.38 33.24 32.48 31.73 31.60 31.20
4 34.26 32.20 31.46 31.15 31.02 30.94
5 33.04 31.83 31.15 30.90 30.64 30.22
6 32.45 31.13 30.83 30.63 30.55 30.13
4.Conclusion and future work
Although there are several methods for speech recognition,it is still an open problemdue to lack of a fast and accurate
algorithm.In this paper,a newrecognition approach was introduced based on particle swarmoptimization.The major idea
in this approach is focused on generating an initial population of segmentation vectors in the solution search space and
correcting and improving the location of segments.The algorithmwas tested on both isolated word speech recognition and
phone classification tasks.The experimental results show that this idea works properly to move toward global optimum
while maintaining the Viterbi systemaccuracy.
Considering the computational complexity of the PSO-based recognition procedure and its pruning capability before
achievingthe best path,it seems that this methodcouldbe well employedincontinuous speechrecognitiontasks.Therefore,
we are pursuing our research on continuous speech recognition.
References
[1]
L.R.Rabiner,A tutorial on Hidden Markov Models and selected applications in speech recognition,Proceedings of the IEEE 77 (2) (1989) 257286.
[2]
S.Mizuta,K.Nakajima,A discriminative training method for continuous mixture density HMMs and its implementation to recognize noisy speech,
Journal of the Acoustical Society of Japan 13 (6) (1992) 389393.
[3]
Q.Y.Hong,S.Kwong,A training method for hidden Markov model with maximum model distance and genetic algorithm,in:Proceedings of IEEE
International Conference on Neural Networks and Signal Processing,2003,pp.465468.
[4]
H.Ney,Dynamic programming parsing for context free grammars in continuous speech recognition,IEEE Transaction on Signal Processing 39 (2)
(1991) 336340.
[5]
S.Kwong,C.W.Chau,K.F.Man,K.S.Tang,Optimisation of HMMtopology and its model parameters by genetic algorithms,Pattern Recognition 34 (2)
(2001) 509522.
[6]
C.W.Chau,S.Kwong,C.K.Diu,W.R.Fahrner,Optimizationof HMMby a genetic algorithm,in:Proceedings of the International Conference onAcoustics,
Speech,and Signal Processing,vol.3,1997,pp.17271730.
1920 N.Najkar et al./Mathematical and Computer Modelling 52 (2010) 19101920
[7]
F.Yang,C.Zhang,G.Bai,A novel genetic algorithmbased on tabu search for HMMoptimization,in:Proceedings of the 4th International Conference
on Natural Computation,vol.4,2008,pp.5761.
[8]
S.Kwong,Q.H.He,K.W.Ku,T.M.Chan,K.F.Man,K.S.Tang,Agenetic classification error method for speech recognition,Signal Processing 82 (5) (2002)
737748.
[9]
P.Bhuriyakorn,P.Punyabukkana,A.Suchato,A genetic algorithm-aided Hidden Markov Model topology estimation for phoneme recognition
of thai continuous speech,in:Proceedings of the 9th International Conference on Software Engineering,Artificial Intelligence,Networking,and
Parallel/Distributed Computing,2008,pp.475480.
[10]
L.Xue,J.Yin,Z.Ji,L.Jiang,A particle swarmoptimization for Hidden Markov Model training,in:Proceedings of the 8th International Conference on
Signal Processing,vol.1,2006,pp.791794.
[11]
H.Sajedi,H.Sameti,H.Beigy,B.Babaali,Discriminative training of Hidden Markov Model using PSO algorithm,in:Proceedings of 12th Annual
International CSI Computer Conference,2007,pp.295302.
[12]
F.Yang,C.Zhang,T.Sun,Comparison of particle swarm optimization and genetic algorithm for HMM training,in:Proceedings of the International
Conference on Pattern Recognition,2008,pp.14.
[13]
S.Kwong,C.W.Chau,W.A.Halang,Genetic algorithm for optimizing the nonlinear time alignment of automatic speech recognition systems,IEEE
Transaction on Industrial Electronics 43 (5) (1996) 559566.
[14]
S.Rategh,F.Razzazi,A.M.Rahmani,S.O.Gharan,A time warping speech recognition systembased on particle swarmoptimization,in:Proceedings of
the International Conference on Modeling and Simulation,2008,pp.585590.
[15]
Y.Shi,R.C.Eberhart,A modified particle swarmoptimizer,in:Proceedings of the IEEE International Conference on Evolutionary Computation,IEEE
Press,Piscataway,NJ,1998,pp.6973.
[16]
J.Yan,H.Tieson,H.Chongchao,W.Xianing,A modified particle swarmoptimization algorithm,in:Proceedings of the IEEE International Conference
on Computational Intelligence and Security,2006,pp.421424.
[17]
N.Najkar,F.Razzazi,H.Sameti,A novel approach to HMM-based speech recognition system using particle swarm optimization,in:Proceedings of
IEEE International Conference on Bio-Inspired Computing:Theories and Application,2009,pp.16.
[18]
J.Kennedy,R.C.Eberhart,Particle swarm optimization,in:Proceedings of IEEE International Conference on Neural Networks,IEEE,Piscataway,NJ,
1995,pp.19421948.
[19]
P.K.Ghosh,S.S.Narayanan,Closure duration analysis of incomplete stop consonants due to stopstop interaction,Journal of the Acoustical Society of
America 126 (1) (2009) El 17.