ECE 539 Final Project
Speech Recognition
Christian Schulze
902 227 6316
freeslimer@gmx.de
1.Introduction
1.1.Problem Statement
Design of a speech recognition Artificial Neural Network that is able to distinguish
between the figures 0
to 9 and the words “yes” and “no”.
1.2.Motivation
1.2.1.Applications
This system can be very useful for cellular phones so that the user only has to say the
digits to dial the desired telephone number.
In addition, it can be used in elevators to get to
the desired floors without having to
push any buttons.
It can be used almost everywhere that choices require number selections
1.2.2.Comparison to huge systems
Conventional speech recognition systems contain huge libraries of all the stored patterns of
the words to be recognized. This requires, of course, a large memory to store the data. In
addition these systems need a fast processor responsible for the non

linear time adjustment of
target patterns to the respective words to be recognized and their com
parison to the patterns.
After the comparison an error is calculated which represents the distance between a pattern
and a word. The pattern with the smallest error will be the solution for the recognition.
These things are too expensive for such small app
lication.
That is why I tried to train a neural network that can do the same.
1.3.Pre

observation of the Artificial Neural Network
I designed a Multilayer Perceptron (MLP) Network using the Back Propagation Algorithm of
Professor Hu to solve this classi
fication problem. Corresponding to the number of words I
want to distinguish at the end, this is a question of 12 different classes. This limits the number
of outputs of the network to 12.
The input data are extracted feature values that represent the sign
al. Here I always store the
first two formants of the spectrum of a signal’s part corresponding to the first two maxima of
the spectrum. That is why the process is also called “Two

Formants

Recognition

System”.
I implemented the whole algorithm using Ma
tlab.
2.Work Performed
2.1.Listing of Steps
2.1.1.Raw data collection
Recording of 50 samples for every word using “Cool Edit”
o
Sampling frequency
: 44100 kHz
o
Channels
: mono
o
Resolution
: 32 bit (quantization of the amplitude values)
St
orage of each sample in its own text file (stored amplitude values for all time values)
Summary of all text file names in a vector “text_files” saved in “textf.mat” using the
function “textfiles.m”
2.1.2.Extraction of input data
Implementation of the tex
t file vector to the function “spectrum_complete.m”
o
Adjustment of all signals to uniform signal length of 500 ms
o
If real signal length smaller than 500 ms missing values filled with zeros
No time stretching or shrinking
No change of the feature data
Windowing of the signal every 10 ms to parts of length of 20 ms using the
Hann window (multiplication of signal with the window)
=> 49 separate sections for each sample
Calculation of the spectrum of these sections using Discrete Fourier Transform
ation
(whole spectrum is periodical with 22050 Hz, but only the first 4 kHz are of interest)
Implementation of the Cepstral Algorithm for the calculation of the spectrum’s
contour (the spectrum becomes smoothed)
o
Calculation of the Cepstral Coefficients
by applying
the Inverse Discrete Fourier Transformation after using the natural logarithm
on the spectral coefficients => change to “quefrency” domain (kind of time)
o
Front values contain information of the high frequencies, back values contain
the informat
ion of the low frequencies
o
Short pass liftering of the Cepstral Coefficients
(taking only the first 147 Cepstral Coefficients)
o
Calculation of the smoothed spectrum by applying the
Discrete Fourier Transformation to the Cepstral Coefficients
Storage
of the first two frequencies for the two first formants in a vector
=> in this example these two values are 441 and 804
o
The vector for every sample becomes 98 by 1
o
Connection of this vector with the desired values corresponding to the
respective class and
storage in a matrix
o
This matrix consists of 50 of these vectors for every word so that its dimension
is 600 by 110.
o
Storage of these “input_data” the file “input.mat”
2.1.3.Input data for the MLP

Network
Scaling of the input data to [

1,1] using “scale_
data.m”
Random division of the “input_scaled.mat” data into the files “training.mat” and
“testing.mat” in a ratio of 9 to 1 using the file “rand_files.m”
(integrated in the Back Propagation Algorithm files “bp_all.m” and “bp_ex.m”)
Implementation of thes
e input data to the MLP

Network
Sizing and optimisation of the network
2.2. Used algorithms and formulas
2.2.1. Discrete Fourier Transformation (DFT)
Reference
:
“Signal Analysis and Recognition”
Rüdiger
Hoffmann
Actually I used Short Time Discrete
Fourier Transformation (DFTF), which corresponds to
the convolution of the spectra of the signal and the window function.
))
1
(
2
exp(
*
)
(
)
(
1
)
,
(
1
N
m
k
N
n
j
k
m
h
k
x
N
m
n
X
m
N
m
k
)
1
,...,
0
(
N
n
where h(k) represents the k

th value of the window function.
The same solution occu
rs when multiplying the signal with the window function and applying
the normal DFT to the product.
1
0
)
2
exp(
)
(
1
)
(
N
k
N
nk
j
k
x
N
n
X
DFT (FFT)
2.2.2. Cepstral
Algorithm
Reference
:
“Signal Analysis and Recognition”
Rüdiger Hoffmann
Human speech consists of
two different parts. At first the vocal chords produce an origin
signal that has a special frequency. This frequency is about 150 Hz for men and about 300 Hz
for women.
After the production of the origin signal, the air is modulated by the vocal tract
that
consists of the pharyngeal, the oral and nasal cavities.
This modulation causes the high
frequencies in the voice. The
maxima
of these frequencies represent the different tones.
The Cepstral Algorithm is used to separate these two signal parts.
The modula
tion of the signal corresponds to a convolution of the signal with the system’s
function of the vocal tract. This equals to a multiplication of both spectra.
)
(
)
(
*
)
(
}
{
*
}
{
}
#
{
Y
G
X
g
F
x
F
g
x
F
# corresponds to the convolution operator
F represents the Fo
urier Transformation
Applying the Inverse Fourier Transformation after using the logarithm of that expression,
separates the two parts of the speech, because they concentrate on different ranges in the time
domain.
)}
(
log
)
(
{log
)]}
(
*
)
(
[
{log
)}
(
{log
1
1
1
G
X
F
G
X
F
Y
F
)}
(
{log
)}
(
{log
1
1
G
F
X
F
The production of a real signal enables the use of the “Real Cepstrum” which is defined in
that way:
²}

)
(

{ln
)
(
1
X
F
c
if x is an energy signal
The calculation of the Cepstral Coefficients causes a change into the “quefrency” domain,
a
kind of logarithmic time domain. The information of the articulation are encoded in the
smaller coefficients. That is why they are applied to a short pass “lifterer” with the border
coefficient “c_border”. The corresponding coefficient to the respective
frequency can be
calculated in the following way:
c
≈ sample frequency / frequency
border frequency = 300 Hz
=>
c_border ≈ 44100 Hz / 300 Hz =147
So the first 147 coefficients are taken and the DFT applied to them. After that the logarithm is
cancelled by using the exponential function and cal
culating the root of the expression to get
the spectral coefficients of the smoothed spectrum (contour).
With this contour it is possible to determine the maxima

the formants

of which the first two
are the representative features I used.
2.3
. Optimisation of MLP
I chose a Multilayer Perceptron

Network and used the Back Propagation Algorithm of
Professor Yu Hen Hu to solve this classification problem.
The network consists of 98 inputs which result of 49*2 formants for each sample and 12
outpu
ts corresponding to the number of different words.
At first I tried to design a two

word recognition system to get an idea of the size of the net.
This resulted in a net that had 3 hidden layers each with 12 hidden neurons to let the
algorithm converge wit
h a classification rate of 100%.
These values I also used for the classifier for the 12 words in the beginning, and after many
tries and tests I established that these dimensions cause the best results for the classification.
Then I tried to optimise param
eter after parameter to get a better result.
I began with the number of samples per epoch and fixed it to 8.
Afterwards I changed the number of epochs and the number of epochs before convergence
test to 200000 and 150.
After that, I experimented with the l
earning rate and the momentum and realized that it is
much better to shrink both. So I finally used 0.01 for the learning rate and 0 for the
momentum. This seems to be similar to the problem of choosing a smaller step size for the
LMS

Algorithm to shrink t
he misadjustment
–
actually the Error Correcting Learning is a
kind of LMS

Algorithm.
Subsequently, I tried to find a better solution by changing the scaling range of my input data. I
established that it is very useful to take a range of [

1,1] instead of
[

5,5].
With these changes the classification rate reached 87% for training and only 65% after testing.
That was a large progress, but a testing rate of 65% isn’t useful for a speech recognition
system, so I had to try to improve upon it.
The distinction b
etween training and testing seemed to be caused by the similarities of the
words zero and seven, which the network had problems deciding between. Also the word four
made small problems.
I tried to use a tuning set of 10% or 30% size of the input data, but
that didn’t make any
improvements.
That’s why I thought about designing a neighbour network

an expert network

which is
responsible for the distinction between the words zero, four, and seven if the main network
gets a maximum output value for one of t
hese words. That caused a large increase of the
classification rate for testing after I had connected both networks together.
2.4.Program files
My programs can be tested in setting the path of “Matlab” to the respective folders and in
sta
rting the demo files.
(Any overwritten necessary mat

files can be restored out of the folder “Original_data”)
Folder 1_Data_Collection
o
Shows the data collection in running “data_collection_demo.m”
Folder Speech_Samples (contains the text files of the samp
les)
data_collection_demo.m
getdata.m
maximum.m
scale_data.m
spectrum_complete.m
textfiles.m
window_functions.m
Folder 2_Training
o
Folder Main_network
Shows the training process of the main network in running “training_main_demo.m”
actfun.m
actfunp.m
bp_all
.m
bpconfig_all.m
bpdisplay.m
bptest.m
cvgtest.m
rand_files.m
randomize.m
rsample.m
scale.m
training_main_demo.m
input_scale.mat
o
Folder Expert_network
Shows the training process of the expert network in running training_expert_demo.m”
actfun.m
actfunp.m
b
p_ex.m
bpconfig_ex.m
bpdisplay.m
bptest.m
cvgtest.m
data_choice.m
rand_files.m
randomize.m
rsample.m
scale.m
training_expert_demo.m
input_scale.mat
Folder 3_Recognition
o
Shows the recognition of a word in running “recognition_demo.m”
actfun.m
getdata.m
ma
ximum.m
recognition_demo.m
scale_data_2.m
spectrum_single.m
window_functions.m
weights_all.mat
weights_ex.mat
final_test_classification_rate.txt (shows the recognition results of
all 600 used samples)
test.txt (appropriate sample “eight”)
3. Final Resul
ts
3.1.Final MLP
3.2.Parameter and Results
3.2.1.Main network
(distinguishes between all words)
inputs : 98
ou
tputs : 12
hidden layer : 3
hidden neurons per hidden layer : 12
learning rate : 0.01
momentum : 0
samples per epoch : 8
number of epochs before convergence test : 150
number epochs until best solution : 200000
classification rate after training : 87.6%
c
lassification rate after testing : 71.7%
logical unit which combines outputs
final output
main network
expert network
..
.
...
...
...
...
...
...
...
98 inputs
12 outputs
12 hidden
neurons
12 hidden
neurons
12 hidden
neurons
3 outputs
3.2.2.Expert network
(distinguishes between the words 0, 4 and 7)
inputs : 98
outputs : 3
hidden layer : 3
hidden neurons per hidden layer : 12
learning rate : 0.01
momentum : 0
samples per epoch : 8
number o
f epochs before convergence test : 150
number epochs until convergence : 38000
classification rate after training : 100.0%
classification rate after testing : 86.7%
3.2.2.Combined network
The function “recognition_network” uses the best calculated weigh
ts of the training.
The logical unit combines the outputs of both networks. If the output of the main network is
maximum for 0, 4 or 7 then the expert network is asked; this increases the total classification
rate of the main network by about 20%.
That me
ans the final classification rate after testing equals to 90.83%.
You are able to apply the final function “recognition_demo “ in using any appropriate text
file. The name of the text file has to be changed in the function.
4.Discus
sion
I only got a classification rate of 70% by using the main network, although I tried to optimise
all parameters, the scaling and so on, that is not enough for a speech recognition system.
I think this is because of the input data: there are vectors t
hat are too similar to each other
causing overlapping of classes. So I decided myself for combining the main network with an
expert network which increased the classification rate to 90.83%.
The similarities of the vectors appear because I don’t use any t
ime adjusting algorithms like
non

linear mapping of all samples to a standard length. I didn’t want the calculation
expenditure to become too large. These are only short words

not longer than one half of a
second

so that the shifting between the tones
could be kept rather small.
I trained the whole network only in using my own speech samples. So the results for samples
of other voices can differ much if the person speaks to fast or too slowly. The worst results are
achieved if a female voice is tested,
because the frequencies lie in a higher range in
comparison to a male voice. I tested my algorithm on some persons and the results oscillate
quite much.
The results that I got could be improved on of course by implementing another expert network
(e.g. for
the words 1, 9 and no), but instead of 20% improvement, of the classification rate in
using this one expert network it would cause only about 1.5 % improvement that would be not
very effective in comparison to increasing the network size. But that depends
on the size of
the network that may be used.
In addition I realized that a single recognition of one word often takes too long. That’s
probably because I used many loops in my program to implement my algorithm, but this is a
point that was difficult for me
to improve.
Altogether I can be quite satisfied with these results, but single aspects of it might be
improved to better its performance.
Appendix:
References:
“Signal Analysis and Recognition”
Rüdiger Hoffmann
“Technical Speech Communication
”
University of Technology Dresden
(Lecture notes)
(Institute for Acoustics and Speech Communication)
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο