Speech Recognition

joinherbalistAI and Robotics

Nov 17, 2013 (3 years and 6 months ago)

118 views


1

Non
-
linear Normalization of
the
Speech Signal

in Relation to the Speaker V
ariability in
Neural Networks of the Speech Recognition



Iesar Gholinejad¹

Seid Ali Salehi²

1
-

MA,
Bioelectric

2
-

A
ssistant
P
rofessor
,
Medical E
ngineering Faculty

of Amir Kabir Ind
ustrial University



Abstract

The Issue of input variability resulting from the speaker changes, is one the most crucial factors
influencing the effectiveness of th
e speech recognition systems.
A solution to this problem

is the
adaptation
or normalization
of the input, in a way that all the input representation parameters are
adapted to a single speaker,
and a kind of normalization is performed on the input pattern in relation
to the speaker changes, before

the recognition is effected.
Three methods are rec
ommended for this
purpose in present paper, in which a portion of the speaker changes influence on the speech
recognition ne
ural networks is compensated.
In all three methods, a network
is
first
trained

for
mapping of the input into the codes representing
the phonic
classes and speakers.

Then, among the
71 speakers under training, the one who is presenting the highest percent of the phone recognition
accuracy is se
lected as the reference speaker to convert the parameter representation of the other
speakers
to the analogous speech

produced by

this speaker.
In the first method, the backpropagation
algorithm of the error
is used for calculating the optimal point of the decision region relating to each
phone from each speaker for all t
he phones and all the speak
ers.
The distance
of
this
points from the
analogous points related to the reference speaker are employed for offsetting the speaker changes
effect
s

and the adaptation of the input sign
al with the reference speaker.
In the second method, using
the error bac
k
propagation algorithm and maintain
ing the reference speaker data as the
desirable

speaker output, we

modulate
all the speech signal frames relating to the training and the test, so that
they coincide on the analogous speech of the reference speaker.

In th
e third method,

a
n inverted

network

is also trained

at first for mapping of the phone data and the speaker to the input
representation. Then the phone data retr
ie
ved from the

direct network is given to the inverted
network together with the reference speak
er data.

Using these data, t
he inverted network
hands out
an estimation of the input representation adapted to the reference speaker. In all three methods,
the
ultimate speech recognition model is
trained with
the
adapted
training data
, and is tested by th
e
adapted testing data.
Implementing these three methods and integrating the final network results
with un
-
adapted network
based on the highest

confidence level, 2.1%, 2.6% and 3% increase in
phone recognition correctness on the clean speech is gained resp
ectively.



Key words:

Speaker variability, Input adapta
tion, Non
-
linear normalization, Inverting neural
networks
.





1
-

Introduction


In all studies somehow related to the pattern recognition,

the issue of the variability exists. The
variability i
s attributed to factors influencing the patterns and make their recognition difficult. The
following can be mentioned as examples of these variability

s in speech recognition: changing of
the speaker, change of the accent and tune of the voice, change of t
he voice pitch, the changes of
the distance and direction of the of the speaker, the change of the environment effects such as
walls, etc., change in direct transmit channel, change of the PA system, change in the voice
velocity, and etc.


2

Together with de
velopments achieved in speech recognition researches, the importance of the
speaker changes influence is increasingly clarified.

The Speaker Dependent systems efficiency is
higher than the Speaker Independent systems, and the speaker variability analysis t
ries to minimize
this gap.
We aim in this present study to compensate the variability resulting from the speaker
changes in auto recognition of the speech signals. Some of the recent studies show that resistant
performance of the human perception system
in

speech recognition, image recognition, etc. is
related to the special mode of the signal processing performed in our brain.
Some of
the
characteristics of this special mode processing attributed to the brain cortex are mentioned as
follows:

1) Two
-
way si
gnal processing

in neural networks, with feedforward, and inverted or reversal
connections. 2) The ability of non
-
linear processing of the signals, and analysis of the linear and
non
-
linear components, existing in the input signal. 3) The ability of
removi
ng special conditions
effect
.

The method employed in present paper for compensation of the input variability is regulating the
input in a way that all the signals
are adapted to a single speaker, hence
carrying out
a kind of non
-
line
ar normalization
on
the signal
in relation with the speaker.
Consequently, the signals are
equalized in respect of the data content, and the speaker variability effect is eliminated.
T
o
simulate the above said regression connections

we p
ropose and implement two propositions

i
n this
paper.

In the first proposition
, the error backpropagation method is used for input adaptation,

m
eaning

that at first a network is trained, and then by stab
i
lizing the
learned weights

and the present
output error backpropagation in

relation to the
desirable

output and correcting the input, the input is
directed toward adaptat
ion with the desired output.
In the second method
, we suggest to train an
inverted network beside the direct one, and employ this inverted network for approximation of the
direc
t network input, maintaining our
desirable

changes.


This paper includes the following parts:

In the second part, the issue of the speaker
variability is
described with more details. In the third part, the compensation methods of
the speaker variability
employed so far are
reviewed
.
The
basis for the theory of the input
adaptation idea on the error backpropation method has been introduced in the forth part.
The fifth part describes the structure of the models employed in the present study.

The
sixth part
describes the algorithm of the proposed method.

The experiments and the
results analysis are expressed in the seventh part.

The final part of this paper has been
allocated for discussion and conclusion of the study.


2. The speaker variability issue

When
we train a model with speech signals related to a special speaker, this model forms the
decision regions related to this speaker, and with the aid of these regions decides about the
speech

signal of the new speaker.
Admittedly, the correctness of the phone

recognition on the
test data

is substantially lower than the training data.
The reason for this,

is the new signal
changes in relation to the training signal. In fact, the relationship between the
1
x

from the
feature

vector relating to

the training signal and
2
x
from the
feature

vector relating to the test
signal can be written as follows:

(
1
)

)
(
1
1
2
x
v
x
x



where the function
)
(
X
V
descri
bes the effects of the variability

s

such as the speaker changes,
microphone, voice pitch, accent, and etc. A part of this variations is related to the difference among
speech characteristics of the speakers, and we are trying to compensate this pa
rt in our work.

As the
)
(
X
V
is a complicated non
-
linear and variable against time function, it must be calculated
momentarily and numerically. For example, to compensate the speaker variability, the

3

)
(
X
V
function must
be separately calculated for every frame of every speaker

s speech.

So it would
be necessary to use from a
flexible and comparative model for this purpose.


The neural networks can be a good choice for this.

The most important advantages of the
networks
are:

1) Implementing of the neural networks usually requires for few assumptions
and use from the huge data bases in a perfect manner.

2) The error backpropagation method
can be generalized to any optimization criterion like
Maximum
Lik
e
lihood

an
d most oth
er
training methods.
3) The neural networks can be easily accommodated by the bigger
structures as modules [3].

3. A review to previous works

In a variety methods used for compensating the effect of the speaker changes, the
expression

)
(
X
V
shown in relationship (1) or a functi
on of it would be calculated.
Then, either by omitting
)
(
X
V
this variability is removed from the desired patterns and the reference pattern is achieved, or
the recognition is carried out by the help

of
)
(
X
V
and implementing it beside the desired pattern.

Three methods have been used so far to compensate a part of this gap:

1) Adaptation of the characteristics (normalization), that is converting the training and the test
data chara
cteristics, so that the speakers differences are omitted from them, and

they nearly look
alike.
This conversion must be implemented on the training patterns before training the model.

Admittedly, a model being trained with these normalized data shall be n
ear to the SD system, and
the said method will offset part of the speaker variability. The normalization process must also be
carried out on all of the test pa
tterns when testing the model.
In case the conversion is not
effected on any of the test patterns
, the speech under estimation for the normalized system would
be unnatural. H
owever
, because of the
)
(
X
V
being a non
-
linear function and variable against time,
the calculation of it would be quite difficult, and it is not omitted from the

signal easily.

So
in
normalization process, only some of the insignificant effects of the s
peaker changes can be
omitted.
This is because most of the normalization methods employed in the previous papers and
works are linear. These normalizations are usua
lly based on the statistical models and processing
of the digital signals, which are normally accompanied with much approximations, and the neural
networks capabilities have been neglected.

2) The adaptation of the speaker, that means speaker independent

regulation of the model, so that it
is in proportion to the new voice. The model adaptation algorithms are different as regards to the
parameter types regulated, the criterion employed and the limitations i
mplemented on the
conversions.

The linear conver
sion family (ex. Miller [4]), the Baysian training family (ex. MAP
[5]), and the speaker space family (ex. specific voice [6]) are methods used to solve this problem.

In some other works, the attempt has been to use clustering for compensat
ing the speaker

changes
[7].
In other methods, numerous models such as Sex Dependent, Accent Dependent, etc. are made
and then a method for model selection is employed for adaptation [8]. But the complexity of the
speech recognition models and the need for huge data ass
emblies make the model adaptation
methods very complicated. Too much free parameters in these models make their analysis
difficult. The model adaptation models are also have been effected using implementation of the
statistical methods.

3) Another approach

to encounter the speaker change problem is that instead of normalization and
discarding the speaker data or adaptation, the speaker data are used for recognition [9].



4.

Employing the error backpropagation algorithm for input adaptation

In two methods
from the three proposed
methods in this paper,
the error backpropagation
algorithm has been used for adaptation of the input.

So, at first it is necessary to express this
idea accurately.



4

Assume that the network training
data
base

and output labels anal
ogous to them are at
hand.

In that case,
the goal of the network training is to find the network weights, so
that they are capable of mapping the input data into the
desirable

output (analogous
labels) in a perfect manner.

In the error backpropagation meth
od a d
efinite cost

function is assumed for network training, which normally a suitable option for this
function would be
the

sum

of squares of the difference

between the
desirable

output
and the actual output of the network.
So, the cost

function is writte
n as follows:

(
2
)






O
S
n
n
n
y
y
C
1
2
)
ˆ
(
2
1

where
y
n

is the
n
nd

output
of the network
,
n
y
ˆ
its approximation and the
o
S
is the number of network
output
neuron
s
.

The coefficient

2
1

has only been in
cluded for simplification of the calculations.

If the
network is having only one hidden layer and
I
S

,

H
S

represent the number of the input and hidden
layer neurons
, the value of
n
th neuron
of the output layer
w
ill be equal to:




(
3
)

)
(




I
S
k
k
jk
j
x
w
f
u

(
4
)

)
(
)
(
)
(
ˆ
n
a
f
S
j
j
u
nj
v
f
u
n
y
H






w
here

w
and
v
are first and second layer weights matrixes respectively and
f
is the non
-
linear
activation functio
n of⁴he networks neurons.


The error backpropagation algorithm is a conventional method for training of the feedforward
neural networks.
Consider that a network has been trained using the above method for mapping
of the input to the output labels, a
nd the output error curve
is
convergent

in respect of the Min
value and
stability.

Now assume that we want to change the inputs of this network, so that
all
inputs⁡reapped into on
e

newutputabel, with
the
same
trained weights
.

In that case the inputs

must move in a direction so that the difference between the present output
of the network and the new label is minimized. For this purpose, the error backprpagation
algorithm can be used.
Here again the error function representation is the same as the rel
ationship
(2). However it must be noticed that this error function must be defined on that part of
the output
that has got new lab
e
l

At this stage, the direction and the value of the input data movement in
theirulti
-
dimensional space must be calculated
to minimize⁴he new⁥rror.

(
5
)

x
x
x
new




To determine the
x

value, the error backpropagation algorithm basis is used, except that the
networks trained weights here⁡re constant andnly the⁩nput changes. In this case w
e have:

(
6
)

C
x
x







where
C
is the same as the relationship (2). But the label
n
y
may be different from the

n
y
during
training. Accordingly;

(
7
)











O
S
n
k
n
n
n
k
x
y
y
y
x
C
1
ˆ
)
ˆ
(

Like the

calculations done⁦or⁡cquiring the⁥rror backpropagation algorithm relationships:

(
8
)

k
n
n
n
k
n
x
a
a
f
x
y







)
(
ˆ


5

(
9
)









H
S
j
k
j
nj
k
n
x
u
v
x
a

(
01
)

jk
j
j
k
j
w
a
f
x
u





)
(

(
00
)

jk
j
j
S
n
S
j
nj
n
n
n
n
w
a
f
v
a
f
y
y
x
O
H














)
(
)
(
)
ˆ
((


Adding up this value
with
x
and
gaining
new
x
and recalculating output, the value of the
C
function decreases and this will continue until
minimization of this function.
Since the
x

value in this method is calculated

momentarily
and step by step, the modification of

x
is
carried out dynamically
and finally it will give us
approximate convergence from the input
analogous to the new output, thereby we can convert the present input into the analogous in
put
from
the reference speaker.

In continuation
of
this paper, the algorithm of the two proposed
methods which⁵se⁥rror backpropagation for⁲egulating of the input

will be⁤iscussed.



5. The structure of the employed models

The structure of the reference networ
k and
its

training procedure

In
the
present work, first a neural network is trained as the reference network for
efficiency
estimation of the

presented method, and then the results gained will be compared with the
network

s results.


The reference network

employed here is a time delay neural network
(
TDNN
), the structure
of which is shown in Fig. 1 [10].

Each time
14 consecutive frames
are all at once

implemented into the network input.
This network has two hidden layers: the first layer has
no time delay,

and the second one has time delay. The output layer also is having 35 neurons
,
each one
representing one of the Farsi speech phones.

For every phone we have a
35 bit
s

code as the desirable output, one
of
which

being
as
"
1
" and the others
as
"Zero".

This c
ode is
analogous to the seventh frame label of the input.
Also for determining the output of the
network

we define the
one
neuron
with
max. value

among
the
se

neurons and
ass
ign the value
of 1 to it

and take the others as zero.

The non
-
linear function of ea
ch neuron is a
sigmoid

function that results in fine decision at the output of each phone.


Characteristics of 14 input frames, phone code


Fig. 1
-

Reference network structure


Besides the reference network, a tool necessary fo
r implementing all proposed methods is a
model being capable of simultaneously mapping of the speech signal characteristics into
the phone code and analogous speaker code.
The structure of this model which is called
Primary
Network in our work is shown in
Fig. 2.
The phone label

is precisely similar to th
e reference network
دددك
اوآ

41
×
41

41
×
23

431



23



هصخشم

ياه
41

يدورو ميرف


6

and

a
72 segmented code

is used for coding of the t
raining speakers (71 speaker).
In this coding method
,
72 neurons are used to indicate the speaker code.
For each speaker one neuron tak
es the value of 1 and
all other neurons are
assigned
as zero.
72
nd

neuron
also

is demonstrative for silence.
To determine
recognition

of the network from the speaker, we define the
one

neuron

with max. value
f
rom
the
s
e

72 neurons and assign the value of 1
to it, assuming the re
st as zero.
The speaker
whose
talks the primary network is capable to recognize better than the others shall be ch
osen as the
reference speaker.
In all the input adaptation methods mentioned in this paper, the attempt is
that the
talk
ing of all the speakers somehow be converted to the reference speakers


talks.


Characteristics of 14 input frames, phone code

Fig. 2
-

Primary network structure


6. The proposed method
s

for adaptation of the input to a specifi
c speaker

Adaptation of the speaker to calculation method of the phone decision region centers


We previously said that when the model is trained on a certain data, a decision region
analogous to each phone from each speaker is
formed

at the input.
These r
egions interfere

on
borders and the network decision making sometimes encounters with problems in
overlapping section

of the borders
.

But in the center
s

of the regions the best decision is made
about the phone, and in fact the optimal point analogous to ea
ch phone is situated in the input
space of the decision region
center related to that phone.


In this proposed method, the error backpropagation algorithm is used for calculation of these
centers.

Then the distance curves in between these centers are calc
ulated against the analogous
phone centers from a certain speaker (reference speaker), and these
distance curves are used to
convey the voice patterns from one decision region to an analogous decision region from the
reference speaker. Actually,
in this me
thod a kind of non
-
linear normalization is done in which,
contrary to
the
usual methods of the normalization, we have considered the fact that the parameter
used for the normalization must be analogous to the present phone [11].

This point originates from
the fact that for example the distance between the
characteristic curves of the phone
«
آ
»

relating
to two speakers in input space is quite different from the distance between the characteristic curves
of the phone
«
ف
»

of the same two speakers.

So, contrar
y to the usual normalization methods, it
would be necessary to use from separate curves for compensating different phones characteristics.
The implementation procedure of this method is described here as follow:

1. Training of the primary network and sel
ection of the reference speaker

2.
By choosing a primary condition for the input, calculation of the decision region relating to
each phone from each speaker is made using the error backpropagation method.

The block
diagram of this step is shown in Fig. 3.



دددك
اوآ

دددددك
ه نيوگ

41
×
41

41
×
23

431

23


هددصخشم

ياه
41

ميرددف
يدورو

411

23

W
IJ

W
JK

W
KP1

W
KP2

W
KP3


7



Primary network, phone
i
th

code, speaker
j
th

code, phone, speaker



Fig.3
-

Calculation of the decision region center of the phone
i
th

from the speaker

j
th


3.
The distance of all the centers gained from the center of the d
ecision region of the
analogous phone of the reference speaker is calculated
and these vectors are stored as the
compensating vectors.

4.
For data adaptation with reference speaker, first it must be determined that each section of the
speech signal is rel
ated to which phone and which speaker.

Then the characteristics related to this
section of the speech signal are added to the compensating vectors gained from the previous step.
Here the test data and training data are adapted on the reference speaker prec
isely alike each other

(Fig
.

4).




Input, primary network, phone, speaker, compensating vector selection, adapted input

4. Data adaptation in center calculation procedure


5
-

A new network is trained using the adapted traini
ng data and is test using the
adapted test data.
This network, with a structure the same as the reference network structure has been specialized
on the adapted data and is expected to recognize the phone with higher accuracy percent (Fig.5)



Fig. 5
-

Network based on the adapted data

Neural network, adapted speech, phone code,


The adaptation of the speaker by using all input frames adaptation method.

In the previous method we adapted the input in one step and using coarse decis
ion.

Also we had no control on the phone content and the signal speaker during the input pattern
changes on borders of the phones.

Through the following method
we try to make

adaptation
of the input in a step by step and momentar
y

manner.

During
the input

adaptation, the phone
and speaker knowledge of the signal is controlled. In this method the difference between the
code of the network recognized speaker and the code of the reference speaker is used for
ad
aptation of the training data.
Consequently the n
etwork data must move in a direction that
meanwhile maintaining their effect for
recognition of the phone at output, they are uniformed
as regards to the speaker

and the difference between network
speaker
recognition
from the
reference speaker code is mini
mized.

For this purpose, the error backpropagation algorithm
can be used.

The implementing procedure of this method is as follows:

1. Training of the primary network and selecting the reference speaker

ةكبش
هيلوا

وآ
ا

ن يوگ
ه

دك
ياوآ
i
ا
م

دددددددددك
ةدن يوگ
j
ا
م

X


new
X

X

هتفاي قيبطت راتفگ

اوآ ك

بصعةكبش
ي


ةكبش
يلوا
ه


ه نيوگ

باختنا
رادرب
ناربج

زاس

رادرددددب
ناربج

يدورو


قيبطت

اي
هتف


وآ
ا

دورو
ي


8

2.
Using from the network acquired from the step (1) f
or adaptation of the training and test
data with the reference speaker.

In order that we take the
same phone acquired by the
network as the label of the phone
output

and accept the speaker output label as equal to the
reference speaker code.

Then we backpr
opagate the difference
between this label and the
output gained from the network (speaker error) and modify the inpu
t toward decrease of this
error (Fig. 6).

Using this method, the compensation is done in a way that the input patterns while
maintain their
phone data,

approach to reference speaker as regards to the speaker characteristics
and adapt with it.

The purpose of doing this is the omission of the non
-
linear effects of the
speaker variability component from the input speech patterns, so that phone da
ta is preserved.
This will continue until the speaker code error is minimized in relation with the reference
speaker.

3.
Similar to the step 5 from the previous method, a new network is trained using the
adapted training data and is tested using the adapte
d test data.



Fig. 6
-

Adaptation of data in all input frames adaptation

Reference speaker code, neural network, phone code,



Speaker adaptation using inverted network.

Another solution for such normalization is that instead of transferring the
decisio
n region by
a compensating vect
or, or one by one tuning of the

frames in a step by step manner, we create
a network trained for such normalization.

That is to have a model
capable of producing the
normalized signal upon knowing the input pattern location i
n the respected decision region.

What has been stored in
the
weights of the primary network (direct network)
is a knowledge
that by using the learned similarities in the input patterns, creates the decision regions relating
to the different phones from di
fferent speakers in the input space. Now the
inverted
network
learns to

relocate the desirable signal using the information existed in the decision regions.

Therefore by suitable modification at the inverted network input, we can displace the decision
regi
ons formed in its input space and get the signal analogous to these displaced decision regions.

For example to convert the speaker talk signal to

reference speaker talks

we can supply the
desirable signal phone data to the inverted network input, togeth
er with the reference speaker
data.
C
onsequently
, the decision regions of the inverted network after displacement reside on
the analogous regions of the reference speaker, and the relocated signal is somehow adapted on
the reference speaker, hence our desi
red non
-
linear no
rmalization has been achieved.
So, if we
can have a network that is inverted as regards to the primary network, it can be used for
transferring the patterns to the desired decision regions. The procedures to be implemented in
the input ada
ptation method using the inverted network are as described bellow:

1
-

Training of the
direct network, so that it is capable to retrieve the phone and speaker data
from the input signal. The network employed at this step is completely similar to the prim
ary
network described in the previous methods and the reference speaker is selected following
the same manner.

2.
Training the inverted network so that it being capable of acquiring information from the
output (or hidden layer) of the direct network, produ
ce
s in its output

an approximation from
the direct network

input, that is talking signal characteristics analog
ous to tho
se information.

ةددش
هيلوا


9

3. Non
-
linear normalization of the training and test signals, that is retrieving of the phone
data from the desired s
ignal through direct network, and giving it to the inverted network
together with the reference speaker data. Therefore, what we have in the inverted network
output would be an approximation of the normalized input signal.


4. The final network is trai
ned by the help of the normalized training signal (adapted with the
reference speaker) acquired in the previous step, and is tested by normalized test signal.


7. Experiments

The data used in these experiments are two sentences from the
FARSDAT

data that
have
bee
n uttered by 101 persons [12].
These data contain clean Farsi speech, recorded in silent
room, and is almost equal to TIMIT data for English language. The sentences
of these data
include about all Farsi phones and contain 48 different syllables.

Th
e Voice data of the first
71 persons have been used as training data and the voice data of the next 30 persons for
model testing.

During the experiment The LHCB parameters
(The energy logarithm of the
Hening square filter bank with critical bands)
, perform
ing high efficiency in previous works
have been us
ed as exploited characteristics.
Each frame includes 18 parameters representing
the energy content of the Hening windows that have been placed over the
Fo
u
rier

conversion

of the desired signal with
Bark
sca
le.
For normalization of the LHCB parameters, the fine
longitudinal normalization
2

has been used [13].


After data preparation, the network training is started. The training algorithm of this network
is error backpropagation algorithm, that has been trai
ned with learning coefficient of 0.1,
momentum

coefficient of 0.7 and random selection of samples.

Randomization in learning prevents the phone and the speaker orders peculiar to the training
data of having any effects on the network training.

Another im
portant point is that after
convergence and termination of the training, should the network is still having the capacity of
learning, it will notice to the specific details contained in the training data, which is not
desirable to us and it will cause decl
ining of the recognition correctness on the test data.
Hence, during the training process, after completion of several training course,
the test data
undergoes a test step and finally the weights having the higher percent of correctness are
selected as th
e network weights.

This network achieves 81.9% of phone recognition
correctness, which we have tried to improve this percent through implementing speaker
adaptation methods.


For the purpose of teaching the primary network it is necessary to make some
chan
ges in
training procedure.
A kind of trade off is observed in training process of this network, that is,
because of the type of sample selection, the network learns the phones at first, and then
proceeds with learning of the speakers and forgets the phone.

Likewise after achieving the
highest percent of correctness on phone recognition (relating to
the
test data), all the network
weights remain constant and the training process continues with correction of the W
kp2

and
W
kp
3

weights matrices.

Using this solu
tion,
meanwhile preserving the network knowledge
about the phones, its knowledge about the speakers increase.

Adding a hidden layer
in the
course of the speaker recognition has been d
ue

to
enabling

the two matrixes

of having enough
capacity to learn the sp
eaker information. Observing the output of the neurons relating to the
speaker code, it can be seen that in case of regulating the weights of the W
kp2

and W
kp
3

(Fig.
2) with a learning coefficient of 0.1, all the outputs are about zero, and the network
is
unable
to pass the

pitch

between 0 and 1.

So to reduce the size of the learning pitch we used from the
learning coefficient of 0.01.

This way
a speaker recognition correctness of 75% on the
training data and a phone recognition correctness of 81.4% on the
test data is achieved at the
frame level. Reviewing the percentage of phone recognition correctness on the training

10

speakers, the speaker No. 61 shows the highest phone recognition percentage (89.6%). This
speaker is selected as the reference speaker.



I
mplementing decision region centers calculation method

During the implementing of this method, the
calculation of the
decision region center

for

each phone
of

each speaker was made using the error backpropagation of output. Then

the
distance of each cent
er from the center of the
analogous

phone of the reference speaker was
calculated and stored as the compensating vector of this phone from this speaker. During the
final network training, first by giving
each section of the signal to the primary network, t
hen
phone code and its analogous speaker was calculated and then by the help of the
compensating vector relating to the achieved phone and speaker code, this part of the input
signal was transferred to the analogous decision regions of the reference speake
r.

The final
network with a structure similar to the reference network was trained using this transferred
data and was tested using the test data which had been adapted on the reference speaker
precisely using the same method.

One iss
ue we encounter with

during the
practice, is that
in all of the proposed methods,
though the implementation of these methods modify much of the reference network faults, in
some instances notwithstanding the true recognition of the primary network, the final
network goes wron
g.

A major reason for this is that
in both algorithms used for tuning the
input, the phone recognition of the pr
imary network is somehow used.
So in implementing of
the methods, the error of the primary network recognition hinders
the correction of input.
This problem during the adaptation of the test data, where its phones label is not av
ailable is
seriously relevant.
The implementing of these methods therefore needs a criterion through
which we are able to
determine
what instances have been corrected by t
he new network
(trained with the adapted data).
To achieve this criterion, we compared the outputs of the
primary and final networks and after investigating of these outputs,
it was noticed that in case
the phone recognition result by the primary network i
s different from that of the final network
as regards to a specific part of the signal, the final network view is only acceptable when the
confidence level of the phone utterance in final network is more than that of the primary
network.
As it was mentione
d before,
to determine the code of the phone recognized by the
network, among the values gained in
output
neurons

of the phone whic
h reside inside the
limitation
(1, 0), we take the highest value as 1 and others as zero. We mean by the
confidence
level of
the
phone utterance

is this Max value, and
use this criterion for determining of the
correction instances.

Through implementing of this method that it is called as the combination
of the primary and adapted networks


results, much of then mentioned instanc
es are corrected.

Hence we have the results indicated in table1, that is 2.1% improved compared to the reference
network.

Table 1: Results gained from the implementation of the first method of input adaptation.





FI乁L
久呗佒
K

C位䉉乁TI
低⁏ ⁔䡅
PRI䵁M
夠Y乄
FI乁L
久呗佒䭓

Pe牣e湴n
ge映瑨f
灨潮p
牥cog湩n楯渠
c潲oec瑮敳t

㠴⸱

㠴⸳


Implementing all input frames adaptation


11

As it was described before, when using this method, after training of the primary network and
selecting the reference speaker, the tr
aining and test data must adapt the reference speaker
using the error
backpropagation algorithm.
The desirable output here is the phone code
recognized by the network
and the reference speaker code (the max value among the output
values is taken as 1 and t
he rest are assumed as zero). So the desirable output of the phone
can be changed.
Similar to the training step, all the frames are selected randomly, but the error
function is only defined on the neurons related to the speaker. This way we say the network

to
transfer the input pattern to the decision region
of this same phone
it has recognized from the
reference speaker.
The error decrease diagram and the correctness percentage of the phone
recognition during regulation of the input is shown in Fig. 7.
As
it is seen from the Fig.
,
although because of random selection of the samples the correctness percentage of the phone
recognition
during the input correction is highly fluctuating,
but
the limit

of this fluctuations
has no considerable change. This behavio
r of the modification network was
predictable
. In
fact we granted the network the
ability of
content changing
of
the phones
, but these

changes
are effected only when the back ppropagated error from the speaker
route permits such
changes.




Fig 7.A) Error

decrease curve of the speaker per number of repetition during tuning of the
input
, B) The correctness percentage of phone recognition on the training data per number of
repetition during tuning of the input,



After
training of the final network with tr
aining data acquired from this step and testing them
on the regulated test dat
a, the results indicated in
table 2 are gained,
showing
2.6%
improvement
compared with the
reference network.


Table 2: The results acquired from implementing the second method o
f input adaptation



FI乁L
久呗佒
K

C位䉉乁TI
低⁏ ⁔䡅
PRI䵁剙⁁M䐠
FI乁L
久呗佒䭓

Pe牣e湴n来
潦⁴桥⁰桯湥
牥cog湩n楯渠
c潲oec瑮敳t

㠳⸱8

㠴⸵8







Implementing of the inverted network method


12

The direct network in this method is quite similar to the prima
ry network.

A solution seeming
at first suitable for training of the inverted network is t
hat of

giv
ing

the indirect network
as
the input
the values of 35 neurons of the phone output and 72 neurons of the spea
ker output
resulted from

the direct network
, an
d take for granted the input signal characteristics as the
desirable output of the inverted network. But the experiment shows that this procedure will
result in divergence of the

inverted network because of abundant discrepancies in mapping of
the results.

So we take the inverted network input equal to the values of the last hidden layer
of the direct network which has more information compared with the last layer for
regeneration

of the signal by the inverted network.

Also for the purpose of error reducti
on as far as possible, we approximate only one frame of
the input signal at
the
output of the inverted network and get all the frames by
slipping
on the
signal.

Also a linear layer is assigned for formation of the numerals outside the limitation of
the sig
moid function output.
So the inverted network structure is gained as shown in Fig. 8.



Fig. 8


Structure of the direct and inverted networks

Approximation from the inverted network signal, direct network, input signal, speake
r

utterance



In Fig. 9 the error reduction curve of this network during training has been indicated.

As it is clear from this figure, the curve error is not reduced from about 0.3.

The reason for this is the
) )ودب ددنچ هب كي(

mapping of this network, and

eventually

the network will be able to produce a non
-
linear median from the possible signals as per a
specific input.





Fig. 9


The error reduction curve of the inverted network training


نادديب
اوآ

نادددددديب
ه نيوگ

41
×
4
1


41
×
23


431

23

411

23

41
×
23


41
×
41


41
×
41


ةكبددددددش
ميقتسم

ةكبددددددش
سوكعم

x

انليدددس :
يدورو

x
ˆ

زا دددددنيمخت :
يدورو انليس

طخ ةيلا


13

As it was mentioned about

the
algorithm of the proposed method
, it is necessary for
normalization that by using the phone information of the desired signal and the information
of the reference speaker, the normalized values are gained by the inverted network.
However
because the inverted network has been trained by t
he values of the hidden layer, it would be
necessary to have another inverted network for mapping of the phone and speaker
information to the hidden layer values. The structure of this network
hereafter

we call it as
medium inverted network is shown in Fig
. 10.


Fig. 10
-

The structure of the direct and medium inverted network

Approximation from the hidden layer, direct network



After training of these two networks we were able to gain an approximation of the input signal at
the

inverted network output.
For estimation of this approximation, we give the approximation
signal to the direct network input, and achieve a phone recognition value of 81.4% at its output.
Also a new network with a structure similar to the reference netw
ork is
trained and tested using
this approximation signal, achieving a phone correctness of 81.8%.

Comparing these results with
the phone recognition correctness of the direct network (81.4%) and the reference network
(81.9%) indicates that the approximati
on signal is sufficiently accurate, either regarding to
similarity with the main signal or from the phone content point of view.



In the following step the input signal

is given to the direct network
and the phone and speaker
information is gained f
rom t
he direct network output.
For carrying out the adaptation, we ignore
the gained speaker information and give the phone data to the input of the medium inverted
network together wit
h the reference speaker code.
The medium inverted network approximates
the v
alue of the hidden layer, so that it contains the signal phone data and the reference speaker
data.
The inverted network, using these approximate values,
regenerates the signal adapted with
the reference speaker

(Fig. 11)
.



Fi
g. 11
-

Data adaptation in the inverted network method

ةكبش
ميقتسم

سوكعم ةكبش

انليدددددس
يدورو

ريداددقم ميددمخت
ناهنپ ةيلا


قيبطت انليس
هتفاي


ة نيوگ اب
عجرم

ةكبش
سوكعم
نايم

ادددعلاطا
اوآ

ة ن يوگ ك
عجرم

ةيلا ريداقم ميمخت
ناهنپ

ادددددعلاطا
ه نيوگ

نايم سوكعم ةكبش

نادديب
اوآ

نادددديب
ه نيوگ

41
×
41


41
×
23


431

23

411

23

x

انليدددس :
يدورو

311

431

h

ريدادددقم :
ناهنپ ةيلا

h
ˆ

زا ددددددنيمخت :
ناهنپ ةيلا ريداقم

ةكبددددددش
ميقتسم


14

Inverted network, layer value approximation, speaker adapted signal, information



The final network, having a structure similar to the reference network is trained using the
adapted training data and i
s tested using the adapted test data.

Hence, the results indicated in
the table 3 is gained, showing an improvement of about 3% compared with the reference
network.


Table 3: The results from implementing the third input adaptation method



FIN
AL
久呗
佒O

C
位䉉乁TI
低⁏
PRI䵁剙⁁M䐠
FI乁L
久呗佒䭓

Pe牣e湴
a来映瑨
灨潮p
牥cog湩n楯

c潲oec瑮敳
s

㠳⸸
5

㠴⸹8


8. Discussion and conclusion

In the present study we tried to reduce the speaker variability effects using non
-
linear
normalization of the signal method
.

In the employed methods all the existing signals in the
test and training data base have been adapted to a reference speaker and this way the speaker
independent system approximately converted to the speaker dependent system

A
ccomplishment

of this method

is highly dependant to the network ability of learning the
speaker codes.

It seems that adding more hidden layers in such a way that it is time efficient,
and clustering of the speakers is a solution for increasing recognition of the speaker. For
clusteri
ng of the speakers the data must be sufficiently rich.

In this case it would be possible to
adapt the clusters with a reference cl
uster, instead of the speakers.

For normalization two
general idea were investigated.

The first idea was employing of the erro
r backpropagation
algorithm for the adaptation of the input, which implemented using two methods of coarse
and soft compensation.

In coarse adaptation
of
data, the compensating vectors which were the
distances of different speakers


decision region centers

from the analogous decision regions
of the reference speaker were used. This method resulted in 2.2 % improvement of the phone
recognition.

In soft adaptation of the data, all the input frames using the error
backpropagation algorithm regulated so that th
ey produced the reference speaker code at
output.
Implementation of this method also resulted 2.6% improvement in phone
recognition.

The second idea was employment of the inverted network for non
-
linear normalization of the
input signal. In this method b
y using the direct network, the phone and the speaker data are
retrieved

from the input signal and these phone data is supplied to the inverted network
together
with reference speaker data.

The inverted network using these information gives an
approximatio
n of the input signal adapted with the reference speaker. Implementation of this
normalization also results in 3% improvement of the phone recognition.

This method compared
with the input modification using error backpropagation produces better results, me
anwhile
maintaini
ng more calculation stability.
Surveying the results gained from the above methods
shows that the vowels and semi
-
vowels are corrected better from all other phonic classes.


15

The reason for this better correction can be attributed to the fa
ct that according the contents of the
phonetics articles, vowels and semi
-
vowels contain more speaker data compared with other
phonetic
classes [8].
So the speaker variability compensation has more influence on them.

Using the confidence level criterion fo
r combination of the results though corrects many
problematic instances during implementation of the methods, there are still instances that are
not corrected

using this criterion.

By investigating the results of the final and primary
networks on different

phonic classes, it is possible to choose a criterion that can be
implemented proportionate to the phone type
, yet producing

better results.




1.

Acknowledgments

The
fulfillment of the
present study has been

achieved by support and cooperation of the
Inte
lligent Processing of the Signs Institute.



References


[1]

E.

Koerner,

H.

Tsujino,

T.

Masutani,

A Cortical Type Modular Neural Network for
Hypothetical Reasoning, Elsevier Science LCD , Neural Network, Vol. 10, Issue 5, 791
-
814
)
1997
(
.

[2]

E. Koerner, M.
O. Gewaltig, U.Koerner, A. Richter, T.Rodemann, A Model of
Computation in Neocortical Architecture, Pergamon Press, Elsevier Science, Neural
Networks, Vol. 12,

pp. 989
-
1005,1999.

[
3
] F. Beaufays, H. Bourlard, H. Franco, N. Margon, Neural Networks in Automa
tic Speech
Recognition, IDIAP Research Report 01
-
09,

March 2001.

[
4
] C. J. Leggetter, P. C. Woodland, Maximum Likelihood Linear Regression for Speaker
Adaptation of Continuous Density Hidden Markov Models, Computer Speech and
Language
,

vol. 9, n
o.
2, 171
-
1
85
)
1995
(
.

[
5
] J. Gauvian, Maximum a posteriori estimation for multivariate Gaussian mixture
observations of Markov chains, IEEE Transactions on Speech and Audio Processing, vol.
2, no. 2, 291
-
298,
)
1994
(
.

[
6
] Y. Tsao, S. M. Lee, F. C. Chou, L. S. Lee, Seg
mental Eigenvoice for Rapid Speaker
Adaptation, proc
eeding of

Eurospeech,
Aalborg, Denmark, CD
-
ROM

(2001)
.

[
7
] E. J. Pusateri, T. J. Hazen, Rapid Speaker Adaptation Using Speaker Clustering,
Proc.
ICSLP
,

Denver, Colorado, 61
-

64

(2002)
.

[
8
] C. Huang, T
. Chen, S. Li, E. Chang and J. Zhou, Analysis of Speaker Variability,
proc
eeding of

Eurospeech,
vol.2, 1377
-
1380

(
2001
)
.


[9] K. Karimi, Implementing the speaker characteristics for quality improvement of the
speech recognition models,
M. A. thesis, Amir

k
abir Industrial University, Medical
Engineering Faculty, Summer 1381.


[10] S. A. Salehi,
Farsi continual speech recognition using human brain performance model
for speech realization,
Ph.D.

thesis, Tarbiat Modarres Univerity, Technical


Engineering
Facu
lty, Bahman 1374.

[11] S.A. Salehi, Linear and non
-
linear
normalization of the representation parameters in
speech recognition
-

In
preparation
.

[
1
2
] M. Bijankhan, et. al, FARSDAT


The Speech Database of Farsi Spoken Language,

proc
eeding of

SST
-
94, Perth,
826
-
831

(
1994
)
.


16

[13] M. Rahiminejad,
S. A. Salehi,
Comparison and efficiency estimation of the representing
parameters exploitation methods in speaker independent speech recognition, Amir Kabir
Scientific Research Journal: 4
th

year, No. 55
-
A, Summer 1382.


-------------------------------



























TRANSLATED FROM FARSI TO ENGLISH BY:

ARAN TECHNOLOGY DATA COMMUNICATIONS CO. LTD.

TELFAX: (+9821) 33917208