linear Normalization of
in Relation to the Speaker V
Neural Networks of the Speech Recognition
Seid Ali Salehi²
of Amir Kabir Ind
The Issue of input variability resulting from the speaker changes, is one the most crucial factors
influencing the effectiveness of th
e speech recognition systems.
A solution to this problem
of the input, in a way that all the input representation parameters are
adapted to a single speaker,
and a kind of normalization is performed on the input pattern in relation
to the speaker changes, before
the recognition is effected.
Three methods are rec
ommended for this
purpose in present paper, in which a portion of the speaker changes influence on the speech
ural networks is compensated.
In all three methods, a network
mapping of the input into the codes representing
classes and speakers.
Then, among the
71 speakers under training, the one who is presenting the highest percent of the phone recognition
accuracy is se
lected as the reference speaker to convert the parameter representation of the other
to the analogous speech
In the first method, the backpropagation
algorithm of the error
is used for calculating the optimal point of the decision region relating to each
phone from each speaker for all t
he phones and all the speak
points from the
analogous points related to the reference speaker are employed for offsetting the speaker changes
and the adaptation of the input sign
al with the reference speaker.
In the second method, using
the error bac
propagation algorithm and maintain
ing the reference speaker data as the
speaker output, we
all the speech signal frames relating to the training and the test, so that
they coincide on the analogous speech of the reference speaker.
e third method,
is also trained
at first for mapping of the phone data and the speaker to the input
representation. Then the phone data retr
ved from the
direct network is given to the inverted
network together with the reference speak
Using these data, t
he inverted network
an estimation of the input representation adapted to the reference speaker. In all three methods,
ultimate speech recognition model is
, and is tested by th
adapted testing data.
Implementing these three methods and integrating the final network results
based on the highest
confidence level, 2.1%, 2.6% and 3% increase in
phone recognition correctness on the clean speech is gained resp
Speaker variability, Input adapta
linear normalization, Inverting neural
In all studies somehow related to the pattern recognition,
the issue of the variability exists. The
s attributed to factors influencing the patterns and make their recognition difficult. The
following can be mentioned as examples of these variability
s in speech recognition: changing of
the speaker, change of the accent and tune of the voice, change of t
he voice pitch, the changes of
the distance and direction of the of the speaker, the change of the environment effects such as
walls, etc., change in direct transmit channel, change of the PA system, change in the voice
velocity, and etc.
Together with de
velopments achieved in speech recognition researches, the importance of the
speaker changes influence is increasingly clarified.
The Speaker Dependent systems efficiency is
higher than the Speaker Independent systems, and the speaker variability analysis t
ries to minimize
We aim in this present study to compensate the variability resulting from the speaker
changes in auto recognition of the speech signals. Some of the recent studies show that resistant
performance of the human perception system
speech recognition, image recognition, etc. is
related to the special mode of the signal processing performed in our brain.
characteristics of this special mode processing attributed to the brain cortex are mentioned as
in neural networks, with feedforward, and inverted or reversal
connections. 2) The ability of non
linear processing of the signals, and analysis of the linear and
linear components, existing in the input signal. 3) The ability of
ng special conditions
The method employed in present paper for compensation of the input variability is regulating the
input in a way that all the signals
are adapted to a single speaker, hence
a kind of non
in relation with the speaker.
Consequently, the signals are
equalized in respect of the data content, and the speaker variability effect is eliminated.
simulate the above said regression connections
ropose and implement two propositions
In the first proposition
, the error backpropagation method is used for input adaptation,
that at first a network is trained, and then by stab
and the present
output error backpropagation in
relation to the
output and correcting the input, the input is
directed toward adaptat
ion with the desired output.
In the second method
, we suggest to train an
inverted network beside the direct one, and employ this inverted network for approximation of the
t network input, maintaining our
This paper includes the following parts:
In the second part, the issue of the speaker
described with more details. In the third part, the compensation methods of
the speaker variability
employed so far are
basis for the theory of the input
adaptation idea on the error backpropation method has been introduced in the forth part.
The fifth part describes the structure of the models employed in the present study.
describes the algorithm of the proposed method.
The experiments and the
results analysis are expressed in the seventh part.
The final part of this paper has been
allocated for discussion and conclusion of the study.
2. The speaker variability issue
we train a model with speech signals related to a special speaker, this model forms the
decision regions related to this speaker, and with the aid of these regions decides about the
signal of the new speaker.
Admittedly, the correctness of the phone
recognition on the
is substantially lower than the training data.
The reason for this,
is the new signal
changes in relation to the training signal. In fact, the relationship between the
vector relating to
the training signal and
vector relating to the test
signal can be written as follows:
where the function
bes the effects of the variability
such as the speaker changes,
microphone, voice pitch, accent, and etc. A part of this variations is related to the difference among
speech characteristics of the speakers, and we are trying to compensate this pa
rt in our work.
is a complicated non
linear and variable against time function, it must be calculated
momentarily and numerically. For example, to compensate the speaker variability, the
be separately calculated for every frame of every speaker
So it would
be necessary to use from a
flexible and comparative model for this purpose.
The neural networks can be a good choice for this.
The most important advantages of the
1) Implementing of the neural networks usually requires for few assumptions
and use from the huge data bases in a perfect manner.
2) The error backpropagation method
can be generalized to any optimization criterion like
d most oth
3) The neural networks can be easily accommodated by the bigger
structures as modules .
3. A review to previous works
In a variety methods used for compensating the effect of the speaker changes, the
shown in relationship (1) or a functi
on of it would be calculated.
Then, either by omitting
this variability is removed from the desired patterns and the reference pattern is achieved, or
the recognition is carried out by the help
and implementing it beside the desired pattern.
Three methods have been used so far to compensate a part of this gap:
1) Adaptation of the characteristics (normalization), that is converting the training and the test
cteristics, so that the speakers differences are omitted from them, and
they nearly look
This conversion must be implemented on the training patterns before training the model.
Admittedly, a model being trained with these normalized data shall be n
ear to the SD system, and
the said method will offset part of the speaker variability. The normalization process must also be
carried out on all of the test pa
tterns when testing the model.
In case the conversion is not
effected on any of the test patterns
, the speech under estimation for the normalized system would
be unnatural. H
, because of the
being a non
linear function and variable against time,
the calculation of it would be quite difficult, and it is not omitted from the
normalization process, only some of the insignificant effects of the s
peaker changes can be
This is because most of the normalization methods employed in the previous papers and
works are linear. These normalizations are usua
lly based on the statistical models and processing
of the digital signals, which are normally accompanied with much approximations, and the neural
networks capabilities have been neglected.
2) The adaptation of the speaker, that means speaker independent
regulation of the model, so that it
is in proportion to the new voice. The model adaptation algorithms are different as regards to the
parameter types regulated, the criterion employed and the limitations i
mplemented on the
The linear conver
sion family (ex. Miller ), the Baysian training family (ex. MAP
), and the speaker space family (ex. specific voice ) are methods used to solve this problem.
In some other works, the attempt has been to use clustering for compensat
ing the speaker
In other methods, numerous models such as Sex Dependent, Accent Dependent, etc. are made
and then a method for model selection is employed for adaptation . But the complexity of the
speech recognition models and the need for huge data ass
emblies make the model adaptation
methods very complicated. Too much free parameters in these models make their analysis
difficult. The model adaptation models are also have been effected using implementation of the
3) Another approach
to encounter the speaker change problem is that instead of normalization and
discarding the speaker data or adaptation, the speaker data are used for recognition .
Employing the error backpropagation algorithm for input adaptation
In two methods
from the three proposed
methods in this paper,
the error backpropagation
algorithm has been used for adaptation of the input.
So, at first it is necessary to express this
Assume that the network training
and output labels anal
ogous to them are at
In that case,
the goal of the network training is to find the network weights, so
that they are capable of mapping the input data into the
labels) in a perfect manner.
In the error backpropagation meth
od a d
function is assumed for network training, which normally a suitable option for this
function would be
of squares of the difference
and the actual output of the network.
So, the cost
function is writte
n as follows:
of the network
its approximation and the
is the number of network
has only been in
cluded for simplification of the calculations.
network is having only one hidden layer and
represent the number of the input and hidden
, the value of
of the output layer
ill be equal to:
are first and second layer weights matrixes respectively and
is the non
n of⁴he networks neurons.
The error backpropagation algorithm is a conventional method for training of the feedforward
Consider that a network has been trained using the above method for mapping
of the input to the output labels, a
nd the output error curve
in respect of the Min
Now assume that we want to change the inputs of this network, so that
inputsreapped into on
In that case the inputs
must move in a direction so that the difference between the present output
of the network and the new label is minimized. For this purpose, the error backprpagation
algorithm can be used.
Here again the error function representation is the same as the rel
(2). However it must be noticed that this error function must be defined on that part of
that has got new lab
At this stage, the direction and the value of the input data movement in
dimensional space must be calculated
to minimize⁴he newrror.
To determine the
value, the error backpropagation algorithm basis is used, except that the
networks trained weights herere constant andnly thenput changes. In this case w
is the same as the relationship (2). But the label
may be different from the
calculations doneorcquiring therror backpropagation algorithm relationships:
Adding up this value
and recalculating output, the value of the
function decreases and this will continue until
minimization of this function.
value in this method is calculated
and step by step, the modification of
carried out dynamically
and finally it will give us
approximate convergence from the input
analogous to the new output, thereby we can convert the present input into the analogous in
the reference speaker.
this paper, the algorithm of the two proposed
methods which⁵serror backpropagation foregulating of the input
5. The structure of the employed models
The structure of the reference networ
present work, first a neural network is trained as the reference network for
estimation of the
presented method, and then the results gained will be compared with the
The reference network
employed here is a time delay neural network
), the structure
of which is shown in Fig. 1 .
14 consecutive frames
are all at once
implemented into the network input.
This network has two hidden layers: the first layer has
no time delay,
and the second one has time delay. The output layer also is having 35 neurons
representing one of the Farsi speech phones.
For every phone we have a
code as the desirable output, one
" and the others
analogous to the seventh frame label of the input.
Also for determining the output of the
we define the
ign the value
of 1 to it
and take the others as zero.
linear function of ea
ch neuron is a
function that results in fine decision at the output of each phone.
Characteristics of 14 input frames, phone code
Reference network structure
Besides the reference network, a tool necessary fo
r implementing all proposed methods is a
model being capable of simultaneously mapping of the speech signal characteristics into
the phone code and analogous speaker code.
The structure of this model which is called
Network in our work is shown in
The phone label
is precisely similar to th
e reference network
72 segmented code
is used for coding of the t
raining speakers (71 speaker).
In this coding method
72 neurons are used to indicate the speaker code.
For each speaker one neuron tak
es the value of 1 and
all other neurons are
is demonstrative for silence.
of the network from the speaker, we define the
with max. value
72 neurons and assign the value of 1
to it, assuming the re
st as zero.
talks the primary network is capable to recognize better than the others shall be ch
osen as the
In all the input adaptation methods mentioned in this paper, the attempt is
ing of all the speakers somehow be converted to the reference speakers
Characteristics of 14 input frames, phone code
Primary network structure
6. The proposed method
for adaptation of the input to a specifi
Adaptation of the speaker to calculation method of the phone decision region centers
We previously said that when the model is trained on a certain data, a decision region
analogous to each phone from each speaker is
at the input.
borders and the network decision making sometimes encounters with problems in
of the borders
But in the center
of the regions the best decision is made
about the phone, and in fact the optimal point analogous to ea
ch phone is situated in the input
space of the decision region
center related to that phone.
In this proposed method, the error backpropagation algorithm is used for calculation of these
Then the distance curves in between these centers are calc
ulated against the analogous
phone centers from a certain speaker (reference speaker), and these
distance curves are used to
convey the voice patterns from one decision region to an analogous decision region from the
reference speaker. Actually,
in this me
thod a kind of non
linear normalization is done in which,
usual methods of the normalization, we have considered the fact that the parameter
used for the normalization must be analogous to the present phone .
This point originates from
the fact that for example the distance between the
characteristic curves of the phone
to two speakers in input space is quite different from the distance between the characteristic curves
of the phone
of the same two speakers.
y to the usual normalization methods, it
would be necessary to use from separate curves for compensating different phones characteristics.
The implementation procedure of this method is described here as follow:
1. Training of the primary network and sel
ection of the reference speaker
By choosing a primary condition for the input, calculation of the decision region relating to
each phone from each speaker is made using the error backpropagation method.
diagram of this step is shown in Fig. 3.
Primary network, phone
code, phone, speaker
Calculation of the decision region center of the phone
from the speaker
The distance of all the centers gained from the center of the d
ecision region of the
analogous phone of the reference speaker is calculated
and these vectors are stored as the
For data adaptation with reference speaker, first it must be determined that each section of the
speech signal is rel
ated to which phone and which speaker.
Then the characteristics related to this
section of the speech signal are added to the compensating vectors gained from the previous step.
Here the test data and training data are adapted on the reference speaker prec
isely alike each other
Input, primary network, phone, speaker, compensating vector selection, adapted input
4. Data adaptation in center calculation procedure
A new network is trained using the adapted traini
ng data and is test using the
adapted test data.
This network, with a structure the same as the reference network structure has been specialized
on the adapted data and is expected to recognize the phone with higher accuracy percent (Fig.5)
Network based on the adapted data
Neural network, adapted speech, phone code,
The adaptation of the speaker by using all input frames adaptation method.
In the previous method we adapted the input in one step and using coarse decis
Also we had no control on the phone content and the signal speaker during the input pattern
changes on borders of the phones.
Through the following method
we try to make
of the input in a step by step and momentar
adaptation, the phone
and speaker knowledge of the signal is controlled. In this method the difference between the
code of the network recognized speaker and the code of the reference speaker is used for
aptation of the training data.
Consequently the n
etwork data must move in a direction that
meanwhile maintaining their effect for
recognition of the phone at output, they are uniformed
as regards to the speaker
and the difference between network
reference speaker code is mini
For this purpose, the error backpropagation algorithm
can be used.
The implementing procedure of this method is as follows:
1. Training of the primary network and selecting the reference speaker
هتفاي قيبطت راتفگ
Using from the network acquired from the step (1) f
or adaptation of the training and test
data with the reference speaker.
In order that we take the
same phone acquired by the
network as the label of the phone
and accept the speaker output label as equal to the
reference speaker code.
Then we backpr
opagate the difference
between this label and the
output gained from the network (speaker error) and modify the inpu
t toward decrease of this
error (Fig. 6).
Using this method, the compensation is done in a way that the input patterns while
approach to reference speaker as regards to the speaker characteristics
and adapt with it.
The purpose of doing this is the omission of the non
linear effects of the
speaker variability component from the input speech patterns, so that phone da
ta is preserved.
This will continue until the speaker code error is minimized in relation with the reference
Similar to the step 5 from the previous method, a new network is trained using the
adapted training data and is tested using the adapte
d test data.
Adaptation of data in all input frames adaptation
Reference speaker code, neural network, phone code,
Speaker adaptation using inverted network.
Another solution for such normalization is that instead of transferring the
n region by
a compensating vect
or, or one by one tuning of the
frames in a step by step manner, we create
a network trained for such normalization.
That is to have a model
capable of producing the
normalized signal upon knowing the input pattern location i
n the respected decision region.
What has been stored in
weights of the primary network (direct network)
is a knowledge
that by using the learned similarities in the input patterns, creates the decision regions relating
to the different phones from di
fferent speakers in the input space. Now the
relocate the desirable signal using the information existed in the decision regions.
Therefore by suitable modification at the inverted network input, we can displace the decision
ons formed in its input space and get the signal analogous to these displaced decision regions.
For example to convert the speaker talk signal to
reference speaker talks
we can supply the
desirable signal phone data to the inverted network input, togeth
er with the reference speaker
, the decision regions of the inverted network after displacement reside on
the analogous regions of the reference speaker, and the relocated signal is somehow adapted on
the reference speaker, hence our desi
rmalization has been achieved.
So, if we
can have a network that is inverted as regards to the primary network, it can be used for
transferring the patterns to the desired decision regions. The procedures to be implemented in
the input ada
ptation method using the inverted network are as described bellow:
Training of the
direct network, so that it is capable to retrieve the phone and speaker data
from the input signal. The network employed at this step is completely similar to the prim
network described in the previous methods and the reference speaker is selected following
the same manner.
Training the inverted network so that it being capable of acquiring information from the
output (or hidden layer) of the direct network, produ
s in its output
an approximation from
the direct network
input, that is talking signal characteristics analog
ous to tho
linear normalization of the training and test signals, that is retrieving of the phone
data from the desired s
ignal through direct network, and giving it to the inverted network
together with the reference speaker data. Therefore, what we have in the inverted network
output would be an approximation of the normalized input signal.
4. The final network is trai
ned by the help of the normalized training signal (adapted with the
reference speaker) acquired in the previous step, and is tested by normalized test signal.
The data used in these experiments are two sentences from the
n uttered by 101 persons .
These data contain clean Farsi speech, recorded in silent
room, and is almost equal to TIMIT data for English language. The sentences
of these data
include about all Farsi phones and contain 48 different syllables.
e Voice data of the first
71 persons have been used as training data and the voice data of the next 30 persons for
During the experiment The LHCB parameters
(The energy logarithm of the
Hening square filter bank with critical bands)
ing high efficiency in previous works
have been us
ed as exploited characteristics.
Each frame includes 18 parameters representing
the energy content of the Hening windows that have been placed over the
of the desired signal with
For normalization of the LHCB parameters, the fine
has been used .
After data preparation, the network training is started. The training algorithm of this network
is error backpropagation algorithm, that has been trai
ned with learning coefficient of 0.1,
coefficient of 0.7 and random selection of samples.
Randomization in learning prevents the phone and the speaker orders peculiar to the training
data of having any effects on the network training.
portant point is that after
convergence and termination of the training, should the network is still having the capacity of
learning, it will notice to the specific details contained in the training data, which is not
desirable to us and it will cause decl
ining of the recognition correctness on the test data.
Hence, during the training process, after completion of several training course,
the test data
undergoes a test step and finally the weights having the higher percent of correctness are
selected as th
e network weights.
This network achieves 81.9% of phone recognition
correctness, which we have tried to improve this percent through implementing speaker
For the purpose of teaching the primary network it is necessary to make some
A kind of trade off is observed in training process of this network, that is,
because of the type of sample selection, the network learns the phones at first, and then
proceeds with learning of the speakers and forgets the phone.
Likewise after achieving the
highest percent of correctness on phone recognition (relating to
test data), all the network
weights remain constant and the training process continues with correction of the W
Using this solu
meanwhile preserving the network knowledge
about the phones, its knowledge about the speakers increase.
Adding a hidden layer
course of the speaker recognition has been d
the two matrixes
of having enough
capacity to learn the sp
eaker information. Observing the output of the neurons relating to the
speaker code, it can be seen that in case of regulating the weights of the W
2) with a learning coefficient of 0.1, all the outputs are about zero, and the network
to pass the
between 0 and 1.
So to reduce the size of the learning pitch we used from the
learning coefficient of 0.01.
a speaker recognition correctness of 75% on the
training data and a phone recognition correctness of 81.4% on the
test data is achieved at the
frame level. Reviewing the percentage of phone recognition correctness on the training
speakers, the speaker No. 61 shows the highest phone recognition percentage (89.6%). This
speaker is selected as the reference speaker.
mplementing decision region centers calculation method
During the implementing of this method, the
calculation of the
decision region center
each speaker was made using the error backpropagation of output. Then
distance of each cent
er from the center of the
phone of the reference speaker was
calculated and stored as the compensating vector of this phone from this speaker. During the
final network training, first by giving
each section of the signal to the primary network, t
phone code and its analogous speaker was calculated and then by the help of the
compensating vector relating to the achieved phone and speaker code, this part of the input
signal was transferred to the analogous decision regions of the reference speake
network with a structure similar to the reference network was trained using this transferred
data and was tested using the test data which had been adapted on the reference speaker
precisely using the same method.
ue we encounter with
practice, is that
in all of the proposed methods,
though the implementation of these methods modify much of the reference network faults, in
some instances notwithstanding the true recognition of the primary network, the final
network goes wron
A major reason for this is that
in both algorithms used for tuning the
input, the phone recognition of the pr
imary network is somehow used.
So in implementing of
the methods, the error of the primary network recognition hinders
the correction of input.
This problem during the adaptation of the test data, where its phones label is not av
The implementing of these methods therefore needs a criterion through
which we are able to
what instances have been corrected by t
he new network
(trained with the adapted data).
To achieve this criterion, we compared the outputs of the
primary and final networks and after investigating of these outputs,
it was noticed that in case
the phone recognition result by the primary network i
s different from that of the final network
as regards to a specific part of the signal, the final network view is only acceptable when the
confidence level of the phone utterance in final network is more than that of the primary
As it was mentione
to determine the code of the phone recognized by the
network, among the values gained in
of the phone whic
h reside inside the
(1, 0), we take the highest value as 1 and others as zero. We mean by the
is this Max value, and
use this criterion for determining of the
Through implementing of this method that it is called as the combination
of the primary and adapted networks
results, much of then mentioned instanc
es are corrected.
Hence we have the results indicated in table1, that is 2.1% improved compared to the reference
Table 1: Results gained from the implementation of the first method of input adaptation.
Implementing all input frames adaptation
As it was described before, when using this method, after training of the primary network and
selecting the reference speaker, the tr
aining and test data must adapt the reference speaker
using the error
The desirable output here is the phone code
recognized by the network
and the reference speaker code (the max value among the output
values is taken as 1 and t
he rest are assumed as zero). So the desirable output of the phone
can be changed.
Similar to the training step, all the frames are selected randomly, but the error
function is only defined on the neurons related to the speaker. This way we say the network
transfer the input pattern to the decision region
of this same phone
it has recognized from the
The error decrease diagram and the correctness percentage of the phone
recognition during regulation of the input is shown in Fig. 7.
it is seen from the Fig.
although because of random selection of the samples the correctness percentage of the phone
during the input correction is highly fluctuating,
of this fluctuations
has no considerable change. This behavio
r of the modification network was
fact we granted the network the
, but these
are effected only when the back ppropagated error from the speaker
route permits such
Fig 7.A) Error
decrease curve of the speaker per number of repetition during tuning of the
, B) The correctness percentage of phone recognition on the training data per number of
repetition during tuning of the input,
training of the final network with tr
aining data acquired from this step and testing them
on the regulated test dat
a, the results indicated in
table 2 are gained,
compared with the
Table 2: The results acquired from implementing the second method o
f input adaptation
Implementing of the inverted network method
The direct network in this method is quite similar to the prima
A solution seeming
at first suitable for training of the inverted network is t
the indirect network
the values of 35 neurons of the phone output and 72 neurons of the spea
the direct network
d take for granted the input signal characteristics as the
desirable output of the inverted network. But the experiment shows that this procedure will
result in divergence of the
inverted network because of abundant discrepancies in mapping of
So we take the inverted network input equal to the values of the last hidden layer
of the direct network which has more information compared with the last layer for
of the signal by the inverted network.
Also for the purpose of error reducti
on as far as possible, we approximate only one frame of
the input signal at
output of the inverted network and get all the frames by
Also a linear layer is assigned for formation of the numerals outside the limitation of
moid function output.
So the inverted network structure is gained as shown in Fig. 8.
Structure of the direct and inverted networks
Approximation from the inverted network signal, direct network, input signal, speake
In Fig. 9 the error reduction curve of this network during training has been indicated.
As it is clear from this figure, the curve error is not reduced from about 0.3.
The reason for this is the
) )ودب ددنچ هب كي(
mapping of this network, and
the network will be able to produce a non
linear median from the possible signals as per a
The error reduction curve of the inverted network training
زا دددددنيمخت :
As it was mentioned about
algorithm of the proposed method
, it is necessary for
normalization that by using the phone information of the desired signal and the information
of the reference speaker, the normalized values are gained by the inverted network.
because the inverted network has been trained by t
he values of the hidden layer, it would be
necessary to have another inverted network for mapping of the phone and speaker
information to the hidden layer values. The structure of this network
we call it as
medium inverted network is shown in Fig
The structure of the direct and medium inverted network
Approximation from the hidden layer, direct network
After training of these two networks we were able to gain an approximation of the input signal at
inverted network output.
For estimation of this approximation, we give the approximation
signal to the direct network input, and achieve a phone recognition value of 81.4% at its output.
Also a new network with a structure similar to the reference netw
trained and tested using
this approximation signal, achieving a phone correctness of 81.8%.
Comparing these results with
the phone recognition correctness of the direct network (81.4%) and the reference network
(81.9%) indicates that the approximati
on signal is sufficiently accurate, either regarding to
similarity with the main signal or from the phone content point of view.
In the following step the input signal
is given to the direct network
and the phone and speaker
information is gained f
he direct network output.
For carrying out the adaptation, we ignore
the gained speaker information and give the phone data to the input of the medium inverted
network together wit
h the reference speaker code.
The medium inverted network approximates
alue of the hidden layer, so that it contains the signal phone data and the reference speaker
The inverted network, using these approximate values,
regenerates the signal adapted with
the reference speaker
Data adaptation in the inverted network method
ة نيوگ اب
ة ن يوگ ك
ةيلا ريداقم ميمخت
نايم سوكعم ةكبش
زا ددددددنيمخت :
ناهنپ ةيلا ريداقم
Inverted network, layer value approximation, speaker adapted signal, information
The final network, having a structure similar to the reference network is trained using the
adapted training data and i
s tested using the adapted test data.
Hence, the results indicated in
the table 3 is gained, showing an improvement of about 3% compared with the reference
Table 3: The results from implementing the third input adaptation method
8. Discussion and conclusion
In the present study we tried to reduce the speaker variability effects using non
normalization of the signal method
In the employed methods all the existing signals in the
test and training data base have been adapted to a reference speaker and this way the speaker
independent system approximately converted to the speaker dependent system
of this method
is highly dependant to the network ability of learning the
It seems that adding more hidden layers in such a way that it is time efficient,
and clustering of the speakers is a solution for increasing recognition of the speaker. For
ng of the speakers the data must be sufficiently rich.
In this case it would be possible to
adapt the clusters with a reference cl
uster, instead of the speakers.
For normalization two
general idea were investigated.
The first idea was employing of the erro
algorithm for the adaptation of the input, which implemented using two methods of coarse
and soft compensation.
In coarse adaptation
data, the compensating vectors which were the
distances of different speakers
decision region centers
from the analogous decision regions
of the reference speaker were used. This method resulted in 2.2 % improvement of the phone
In soft adaptation of the data, all the input frames using the error
backpropagation algorithm regulated so that th
ey produced the reference speaker code at
Implementation of this method also resulted 2.6% improvement in phone
The second idea was employment of the inverted network for non
linear normalization of the
input signal. In this method b
y using the direct network, the phone and the speaker data are
from the input signal and these phone data is supplied to the inverted network
with reference speaker data.
The inverted network using these information gives an
n of the input signal adapted with the reference speaker. Implementation of this
normalization also results in 3% improvement of the phone recognition.
This method compared
with the input modification using error backpropagation produces better results, me
ng more calculation stability.
Surveying the results gained from the above methods
shows that the vowels and semi
vowels are corrected better from all other phonic classes.
The reason for this better correction can be attributed to the fa
ct that according the contents of the
phonetics articles, vowels and semi
vowels contain more speaker data compared with other
So the speaker variability compensation has more influence on them.
Using the confidence level criterion fo
r combination of the results though corrects many
problematic instances during implementation of the methods, there are still instances that are
using this criterion.
By investigating the results of the final and primary
networks on different
phonic classes, it is possible to choose a criterion that can be
implemented proportionate to the phone type
, yet producing
fulfillment of the
present study has been
achieved by support and cooperation of the
lligent Processing of the Signs Institute.
A Cortical Type Modular Neural Network for
Hypothetical Reasoning, Elsevier Science LCD , Neural Network, Vol. 10, Issue 5, 791
E. Koerner, M.
O. Gewaltig, U.Koerner, A. Richter, T.Rodemann, A Model of
Computation in Neocortical Architecture, Pergamon Press, Elsevier Science, Neural
Networks, Vol. 12,
] F. Beaufays, H. Bourlard, H. Franco, N. Margon, Neural Networks in Automa
Recognition, IDIAP Research Report 01
] C. J. Leggetter, P. C. Woodland, Maximum Likelihood Linear Regression for Speaker
Adaptation of Continuous Density Hidden Markov Models, Computer Speech and
vol. 9, n
] J. Gauvian, Maximum a posteriori estimation for multivariate Gaussian mixture
observations of Markov chains, IEEE Transactions on Speech and Audio Processing, vol.
2, no. 2, 291
] Y. Tsao, S. M. Lee, F. C. Chou, L. S. Lee, Seg
mental Eigenvoice for Rapid Speaker
Aalborg, Denmark, CD
] E. J. Pusateri, T. J. Hazen, Rapid Speaker Adaptation Using Speaker Clustering,
Denver, Colorado, 61
] C. Huang, T
. Chen, S. Li, E. Chang and J. Zhou, Analysis of Speaker Variability,
 K. Karimi, Implementing the speaker characteristics for quality improvement of the
speech recognition models,
M. A. thesis, Amir
abir Industrial University, Medical
Engineering Faculty, Summer 1381.
 S. A. Salehi,
Farsi continual speech recognition using human brain performance model
for speech realization,
thesis, Tarbiat Modarres Univerity, Technical
lty, Bahman 1374.
 S.A. Salehi, Linear and non
normalization of the representation parameters in
] M. Bijankhan, et. al, FARSDAT
The Speech Database of Farsi Spoken Language,
 M. Rahiminejad,
S. A. Salehi,
Comparison and efficiency estimation of the representing
parameters exploitation methods in speaker independent speech recognition, Amir Kabir
Scientific Research Journal: 4
year, No. 55
A, Summer 1382.
TRANSLATED FROM FARSI TO ENGLISH BY:
ARAN TECHNOLOGY DATA COMMUNICATIONS CO. LTD.
TELFAX: (+9821) 33917208