Voice Recognition by a Realistic Model of Biological Neural Networks

sciencediscussionAI and Robotics

Oct 20, 2013 (3 years and 11 months ago)

73 views

Voice Recognition by a Realistic Model of Biological Neural
Networks


By Efrat Barak
*


Supervised by Karina Odinaev and Igal Raichelgauz



*
visl.technion.ac.il/~efrat_b


Abstract


In this project, a new model for a voice recognition system
i
s suggested.
The model is
based on a realistic model of neural networks, and it integrates principles from the
theories of chaotic systems and Liquid State Machine.

The model
wa
s i
mplemented in
MATLAB, and several
tests were performed on it
.
T
he task of the system
in t
hose
tests was

to recognize a voice of a specific person (a voice that it was trained on) out
of hundreds of other voices.



The
P
roblem


The objective of this project was to design a system that can
classify voices, i.e.,
recognize the voice of a
specific

person
.

However, the problem of voice classification
is too wide to be solved by finite state machines, since it is obviously impossible to
create a state for every word that every person in the world might say. Even if it were
possible, it is impossible
to save recordings of every word in the voice of every person
in order to perform bit

by
-
bit comparison.



The S
olution


The task of voice recognition is highly suitable for neural networks, since such
networks can work as classifiers and distinguish the
voice that they learn to identify
from other voices. Such networks can learn the characteristics of the voice, and
therefore

using them does not require endless recordings of words.


A new model for a voice recognition system which is based on neural netw
orks is
suggested in this project.

Our approach for voice recognition integrates concepts from
the theories of chaotic neural networks and Liquid State Machines. The main
principal of the proposed model is that the input signal is recognized upon the curre
nt
state of the model, which is a limit cycle in which the output of the network is
periodic

and uniquely defines that state
.

We defined this state as a
basin
.

The model
is presented in figure 1.


Figure 1:

The model of the proposed system for voice rec
ognition.


The input signal

, which represent the auditory
stimulus
, consists of several
parallel spike trains which are transmitted simultaneously to the input neurons of the
neural network (see section 4.1 for more information
on the creation of the stimulus).
The input neurons transmit an internal signal

to the neural network
, which
consists of a few hundreds of neurons in a three dimensional structure. The network is
push
ed (by the input signal) to

a basin.
The readout function

receives the spike
trains of all the neurons in

and recognizes the basin that the network has
converged to by comparing it with output pattern
s of basins that appear in the
indicators map. It then classifies the input according to the indicator that belong
s

to
that basin. The output signal

determines the class of voices to which the input
signal belongs.


The neural
network

consists of 135 spiking neurons in a 3x3x15 formation. The
behavior of the neurons is simulated by the Leaky Integrate and Fire (LIF) model, and
the neurons are connected by dynamic spiking synapses. Twenty percent of th
e
neurons in the network
are

randomly chosen to be inhibitory, and the rest of the
neurons are excitatory, in correspondence with the biological values. The connectivity
of the network is moderate (
=2).


An important advantage o
f the proposed model is that several tasks can be performed
with only one network at
the same time: The readout function can be trained to
recognize several people by finding the current basin of the network and comparing it
to several indicators maps (one

for each person).


Tools


MATLAB 6.5 was used for the development
s

of the voice recognition system and a
GUI that enables a full control of it. The tests were performed with two databases of
recorded speech: the first one was recorded in the SIPL laborat
ory of the Electrical
Engineering faculty of the Technion (IIT), and second one was taken from
the NIST

database
that is offered at
http://www.nist.gov/speech/tests/lang/2003/
.

The neural networ
k that was studied in this project has been created in a new
simulator for neural microcircuits,
CSIM
, in MATLAB environment. Full details of
the
CSIM

simulator can be found
at

http://www.lsm.tugraz.at/csim/
.


T
wo methods were used for encoding the recorded speech signal into spike trains:



Amplitude Encoding: In this method, a

straight forward conversion is
performed between the amplitude in time
t

and the number of neurons that
would fire at that time.



MFCC E
ncoding: In this method the auditory signal is represented by Mel
Frequency Cepstral Coefficients (MFCCs), which are coefficients that are
based on human perception. The
auditory
signal is divided to small segments,
and each of them is
transfor
med (by FFT)

to the frequency space. T
he
frequency bands
are positioned

logarithmically

on the mel scale, which is a
scale of pitches that were determined by listeners to be equally distanced from
each other.


The Classification Process

The classification process cons
ists of three main parts:

1.

Training:

In this part the system is
trained on
different auditory
stimuli
: some
of them contain
the voice that the system should learn to identify
, and others
contain voices of other people.

Simulations
of the neural network
are

performed for very voice segment
.

T
he system learns the basins that the
network converges to
in every simulation
and creates an indicators
map:

The
indicators are numbers that are related to each basin, and they indicate how
well this basin represents th
e wanted voice.

2.

Tuning:

in this part the simulations are performed on another database (which
include voice segments of the wanted person and of other persons), and the
user tunes the classification parameters so that the indicators map would best
suit th
e person that the system should identify.

3.

Testing:

In this part a new stimulus is presented to the neural network. The
system finds the basin that the network converged to and makes a
classification decision upon the indicator of that basin. The output is

an
answer whether the stimulus is the voice of the wanted person or of another
person.


The different stages of the voice recognition process are depicted in figure 2.


Figure 2:

A flow chart of the classification process


Results

Several terms were defi
ned for valuation of the classification results:




Hit Segments



Voice segments that were classified correctly.




Miss
-
Hit Segments



Voice segments of the person that the system was trained to
identify, which were classified as voices of other people.




Fal
se Alarm

Segments



Voice segments of different people (not the one that the
system was trained to identify), which were classified as the voice of the wanted
person.


Table 3 presents the database that was used for the tests.




Num. of Voice Segments



Wanted Voice


Other voices

Data Set 1 (Training)

30

300

Data Set 2 (Tuning)

30

30

Data Set 3 (Test I)

100

400

Data Set 4 (Test II)

38

40


Table 3:
The database that was used for the tests



Results for Amplitude Encoded Input

The stimuli that were cr
eated by amplitude encoding were examined, and no
significant difference was found between the stimuli of the wanted person to those of
other persons. The task of the system was therefore to identify the voice of the wanted
person, not to identify a certai
n type of stimulus.

Table 4 presents the results of two classification tests that were performed on two
different (parallel) systems. In these tests, database 1 was used for training the system,
database 2 was used for tuning it and database 3 was used for

testing it.


Initials
Num.

Classified as

Test Num. 1

Test Num. 2

True Classification

1


100

Wanted

(Hit)

71%

94%

100%


Unwanted

(Miss
-
Hit)

29%

6%

0%

101
-
492

Wanted


(False
-
Alarm)

55.9%

61.23%

0%


Unwanted

41.1%

38.77%

100%


Table 4:
The results of c
lassification test
s

number 1 and 2



The results that are presented in table 4 show that both of the systems
identified most
of the segments of the wanted voice.
Obviously, the indicators map of the network
that was used in test 2 was much better than the
one of the first network: 94% of the
wanted voice segments were identified in the second test, while only 71% of them
were identified in the first test.
This shows that the internal structure of the neural
network (which is raffled) significantly influence
s it's classification ability.


T
he difference (in percents) between the segments that were classified correctly to the
false alarm segments was
15% in the first test

and
34% in the second test
. These large
differences show that the classification of any v
oice segment as a wanted segment is
not a random process. The reason for the high false alarm rate is that the system was
designed to find most of the voice segments of the wanted voice, even with a cost of
many other segments that would be classified as w
anted.


Results for MFCC Encoded Input

An examination of the stimuli that were generated in the MFCC method revealed that
the stimuli of the wanted voice were quite similar to each other but
were
very
different from the stimuli of the other voices. In thi
s manner, the classification task
became a task of distinguishing between two types of stimuli.


Table 5 presents the results of two classification tests that were performed on the
same system. Data set 1 was used for training the system, data set 2 was us
ed for
tuning it, and data sets 3 and 4 were used for testing it.




True
Classification


Classified

as

Test I

Segments:

100 wanted,

400 unwanted

Test II

Segments:

30 wanted,

30 unwanted

Wanted

Wanted (Hit)

87%

86.8%

Wanted

Unwanted (Miss
-
Hit)

13%

13.2%

Unwanted

Wanted (False Alarm)

55.3%

45%

Unwanted

Unwanted

44.7%

55%


Table 5
The results of the classification tests


The results detailed in table 6.13 indicate that the system is quite reliable in
classifying new data: most of the segments of

the wanted voice and about half of the
segments of the unwanted voice were classified correctly. A proof for the consistence
of the system rises from the fact that the hit rate was almost similar in the
two
classification
tests even though they consist
ed
of significantly different number of
voice segments and there was no overlapping between them.



Conclusion


A new method for performing voice recognition by a realistic model of biological
neural networks was presented and implemented in this project. Se
veral systems were
configured and trained by the presented method. They were tested on two types of
stimuli: one that was created by amplitude encoding of recorded speech, and another
that was created in the MFCC method. The amplitude encoding method was f
ound to
be efficient, while the MFCC method yielded stimuli which were very typical for each
person.

Tests that were performed on stimuli of both kinds showed that the systems were

efficient in
identifying the

voice that they were trained to find:

The sys
tems found and
classified correctly most of the voice segments of the
'wanted
person
',
even when
they
were given a stimulus

that contained much more voice segments of other people. The
conclusion is that such systems can handle a very high level of noise.
Other tests have
shown that the systems were consisted and stable in their classification performances.

Altogether,
the tests that were carried out in this project proved that our model for a
neural network ba
sed voice recognition system is highly suited
for performing voice
classification tasks
.


Acknowledgment


I am
grateful to
my

project supervisor
s
Karina Odinaev and Igal Raichelgauz

for their
help and guidance throughout this work
.

I would also like to thank
the
Ollendorf
Research Center

Fund for supporting this project.