Voice Enabled Phone Directory
May 4, 2004
Table of Contents:
I. Introduction ……………………………………………………………………………3
II. Statement of the Problem …………………………………………………
III. Proposed Solution ……………………………………………………………………6
IV. Bibliography ...………………………………………………………………………14
V. Appendix (Proj
ect Code & Documentation) ...……………………………………….16
an4.log (generated log file from sphinx
Speech Recognition has been a topic that interested me and I have decided to
focus on it this semester. During the semester I was able to read different articles from
journals and I decided to build an application over Au
tomatic Speech Recognition (ASR).
As I read articles and books, I developed the idea of building a voice enabled phone
directory. The ultimate vision is to have a system that would use speech as its primary
communication. The user would call one number tha
t would connect and communicate
with the phone database using speech and it would dial numbers for the user. Therefore,
the user would only need to know one number that would enable him/her to call any
name in the database via speech.
This system’s prima
ry use is aimed at serving not only people who wish to have
fast and mobile access to calling people but also its main purpose is towards people with
certain disabilities that would benefit form having such system; a system that uses a
‘hands free’ structu
Different companies such as Advanced Recognition Technologies, Inc (ART),
Microsoft, as well as other companies have been integrating/ implementing speech
recognition systems in their software. These voice command based applications will be
to cover many of the communicational aspects of our daily lives ranging from
telephones to the Internet.
There are two types of speech recognition systems. The first is a ‘speaker
dependent system’ that is designed for a single speaker; it is easy to de
velop whereas it is
not flexible to use. The second is a ‘speaker independent system’ designed for any
speaker. It is harder to develop, less accurate and more expensive than the ‘speaker
dependent’ system, but it is more flexible.
The vocabulary sizes of
the Automatic Speech Recognition (ASR) system range
from a small vocabulary that would consist of two words to a very large vocabulary that
consists of tens of thousands of words. The size of the vocabulary affects the complexity,
and the accuracy of the ASR system.
There are a number of different factors that could affect the accuracy and
performance of an ASR system, such as pronunciation and frequency; the speaker's
current mood, age, sex, dialect, inflexions and background nois
e. It is thus necessary for
the system to overcome these obstacles. As an example, the system could use filters to
solve some of these problems like background noises, coughs, heavy breath, etc.
The process of speech requires analog
digital conversion i
n which the voice's
pressure waves are converted to their numerical values, through regular intervals, in order
to be digitally processed. When you replay with an appropriate rate, the revised sound is
then reproduced. There are multiple types of ways you
can save speech. As an example,
you could use wav files, which were initially defined by Microsoft for their multimedia
extensions, where you can store wav files as mono or stereo sound at sampling rate of up
to 44 Khz. Another type is of RAW type, which i
s the basic digital sound format. The
data file is a stream of bytes that represents the amplitude of a single sample. It does not
contain a header file and in order to replay it correctly the sampling rate must be know.
The model that I am going to build
will use these sound formats for the Voice Enabled
Phone Directory (VEPD).
The Hidden Markov Model is a Markov Chain in which the output symbols or
probabilistic functions that describe them. To be specific, it uses the graph structure,
which is the numbe
r of states and their connections, and the number of mixtures per state.
The algorithm consists of a set of nodes that are chosen
to represent a particular vocabulary. These nodes are
ordered and connected from left to right, and recursive
loops are allowe
d. Recognition is based on a transition
matrix of changing from one node to another. A good
way to understand HMM is by giving an example. If we
build a model that recognizes only the word “yes”, then the word is composed of the two
ye’ and ‘
. This corresponds to the six states of the two phoneme models.
To be more accurate “yes” is composed of
s’. The ASR would not know the
acoustic state in mind of the speaker, therefore the ASR system would try to find W by
more likely sequences of states and words W that have generated X.
Here W represents the sequence of ‘words’ and X is the sequence of acoustic sounds.
The HMM is referred to often as a parametric model because the state of the
system at each time t is c
ompletely described by a finite set of parameters. The training
algorithm estimates the HMM parameters by taking a first good guess using the
preprocessed speech data (features) with their associated phoneme labels. The HMM
parameters are kept or stored as
files and then retrieved by the training procedure.
Model training is performed by estimating the HMM parameters, since estimation
accuracy is roughly proportional to the number of training data. The HMM is well suited
for a speaker
independent system bec
ause the speech used during training uses
probabilities or generalizations and that makes it a good system to use for multiple
It is good to notice the difference between an isolated system and a continuous
system. An Isolated system uses a si
either a full word or a letter
at a time. It
is the simplest type because it is easy to find the ending points of a word due to the
pauses between saying the word or letter. In the second type (the continuous system)
uses full sentences and
therefore it would be much harder to find starting and ending
II. Statement of the Problem
The focus of my project is based on having automatic speech interacting phone
directory assistance. It is hard to develop a whole system that uses a ‘hands
environment for the fact that there are a lot of areas to cover. As I said, the ultimate
vision is to have a ‘hands free’ system. I want to build a structure or module that one
could later enhance in the future to support a ‘hands free’ voice enable
III. Proposed Solution
My solution consists of three parts and I will go through them and explain my
approaches and what I would like to obtain out of each.
I will then demonstrate how they all play a part in the
final configuration. Here is
a diagram that will show
the overview of the models, and the next paragraphs are
The first part needed is an ASR system that I would be able to work with in order
to build my speech enabled phone directory. I need a speaker indep
endent system based
on HMM that has a large vocabulary. After researching the matter, I have decided to use
sphinx, based from Carnegie Mellon University, for my ASR system. In sphinx, basic
sounds in the language are classified into phonemes or phones. Th
e phones are
distinguished according to their position within the word (Beginning, end, internal, or
single) and they are further refined into context
dependent triphones. The building
processes of acoustic models are through the triphones. Triphones are m
odeled by HMM
and usually contain three to five states. The HMM states are clustered into a much
smaller number of groups called senone.
The input audio is of 16 bit samples, ranging from 8 to 16 Mhz, which is of a .raw
type. Training consists of having g
ood data that consists of spoken text or utterances.
Is converted into leaner sequences of triphones HMM’s using pronunciation
Finds best state sequence or state alignment through HMM
For each senone, all frames are gathered in th
e training and are mapped in order to
build suitable statistical models. The language model consists of:
Unigrams where the entire set of words and their individual probabilities of
occurrences in language, are considered
Bigrams: the conditional probabil
ity that word2 immediately follows word 1
in the language.
Contains information for some subset of possible word pairs.
It also contains the Lexicon Structure, which is the pronunciation dictionary. It is a file,
which specifies word pronunciation. Pronun
ciations are specified as linear sequences of
phones. Also, it is essential to know that there are multiple pronunciations for the same
word or letter. It also includes a silence symbol <sil> to represent the user’s silence. As
an example, ‘ZERO’ is pronou
nced ‘Z IH R OW’.
The second step in the process, was building a database that included contact
information of people on the directory. I decided to use PostgreSQL for this part because
I had the book, installation CD and I was famili
ar with its contents.
The database, named ADB, will contain a “People” entity, which contains these
: which is an attribute that contains the unique identification for each, and is of
: attribute that contains the fir
st name of a person and is of type
: attribute that contains the last name of a person and it is also of type
: attribute that contains phone number and it is also of type
varchar(12) UNIQE (which means that sy
stem would not accept the same number
more than once).
: attribute that contains city name and its type is varchar(15).
The primary key is (pid, first_name, last_name)
Here is an example of what the Database contains:
Pid | first_name | last_nam
e | phone_num | city
1 | Sam | Smith | 765
2743 | Ramallah
2 | George | Adams | 765
2741 | Richmond
has the person’s information and it will provide the data, which is needed
for the phone directory. In other words it will act as an address book but at the same time
it can select information that will be needed by the application. For example, you can
ither select all names in the directory, or you can select a specific person by first name
or last name. The Application part will talk about db.pm, people.pm, and people.pl which
are scripts that connect, send and retrieve info via ADB.
pplication is the third item in the deliverables and it will serve as the main
connector between the ASR system (sphinx) and the Database (ADB). The application
will serve as easy communication through sphinx and the database to send and receive
n. The programming language used is Perl, along with shell scripting
embedded in the Perl code. The figure below will illustrate the overall structure of my
As seen in the figure above, the overall structure consists of four main stages, whi
recording speech, decoding speech, connecting to database and then finally displaying the
results back to the user.
The main script is the Voice Enabled Phone Directory (VEPD.pm and VEPD.pl)
in which it calls one script at a time, each with its ow
n duty, and then it moves on to the
next script. The best way to go through the architecture is to follow it step by step while,
at the same time, explain what each script does.
The first script called is, the record script (record_wav.pl) and this script
simple objective, which is when it is called information is displayed for user to know
what to do for recording. In order to record the user would press space bar, and to stop
the user would hit space bar again. A system call will be called to recor
d time. It would
also take a rate of 16000 and with an output option
o and it would be recorded as
record000, then record001 ... recordNNN in a directory called wav_files. Although
recording can go up to recordNNN, we only want to deal with only two files
at a time,
because that way we can contain the structure. The file is of type wav that I mentioned
about in the introduction section.
The next script called is the wav_to_raw.pl. This script runs through all the wav
files and changes the format type in
to raw file and it places them in another directory.
First it opens the directory that contains the wav files that were recorded by
record_wav.pl. For each of the wav file names that matches .wav file will be changed to
.raw rather than .wav. Then the syst
em call uses sox, which is a sound file exchange with
rate of 16000 will change the wav file type and then copy it into another directory called
raw_files. There is an option that could be used later on as an enhancement, to replay
what the user recorded i
n the record_wav.pl script. But with this, the new raw file output
will be changed into a wav file so that it would be replayed to user if option is needed.
After we get the raw sound file, we can call the get_speech.pl script. First of all,
the raw file
used will always be record000. Even if person wants to add string, the second
recording, which would be record001 would be cated into record000, so in tern record000
would include the first spoken string plus the second one. Now once record001 is cated
o record000 then record001 is removed, so that if user wants to search more or add
another string it would be saved as record001 again and then the process is repeated. Next
we are going to open the current directory that the ASR (SPHINX) system is located
and we need to put the correct raw sound file in this location so that Sphinx would try to
decode what was said. So here we just put the location of the raw file in the sphinx
location. Then the next step is to define locations of the ASR sphinx locati
(a variable of location) is the sphinx application we are running and define location of
other sources that need to be present as arguments for the S3BATCH. Then it will
execute system call that will run the program and decode speech. A log fil
e will be
generated that will display all commands, what occurred, as well as how the system
(sphinx) got to the decoding of speech; it shows the process.
Finally, a system call will ‘grep’ (or get) the line of decoded text and put in variable that
turn put it in another file, DecodedSpeech.txt. This way other scripts would be
able to use the text generation.
Once we have the decoded text, we would call DisplaySpeech.pl. It contains a
function that the DisplaySpeech.pl simply calls. First, it gets
info whether user wants to
search by First or Last Name. Here we are getting info of which option the user wants to
search by, either by searching by first name, getFirstChar or by searching by last name,
getLastChar. At the beginning, before this script
is run, the main menu function in
VEPD.pm will call the function of whether or not user wants to search by first or last
name, and here were opening the file and storing in a variable the result. Then, we would
want to run the record program or script thro
ugh a function in order to record wav files.
Then we would run the wav to raw script that will change wav files into raw files and
place the raw files into raw files directory
Then we would get speech by running the file get_speech.pl, and then we would op
decoded speech file from ASR sphinx log file and matching and split commands are used
to strip unwanted naming, and get back string of decoded speech into text. There is an
option called Play_wav() that would enable user to hear what the user said i
f the user
chooses to do. For the decoded text, it connects to ADB and it to get back names and
numbers. Here we are connecting to database through db.pm, and getting back info
through people.pm and people.pl.
The script db.pm’s main function is to conne
ct to the postgreSQL database called
ADB. It contains functions that would prepare SQL statement and run or execute them. It
will then fetch rows and put them in array of rows or it would do the same, yet it would
insert array of rows in the database ADB.
The people.pm/pl scripts use db.pm to connect to db and run the SQL statements
in order to get back the results needed. As an example, if you want to search by first
name, then were selecting first_name, last_name, phone_num from people where
matches any of the string needed to run through the ADB.
We are going to use SQLselect from db.pm to connect to ADB and run through SQL
statement. The results getting back will be stored in an array, rows and getting the status
of counter back. If status i
s 0, then there is no result back. If status is 1, then there is only
one match and in that case because there is only one result, it will ask whether or not
want to call that name. If status is more than one, that means that the user might either
her string or can pick from the list and all that will be done with a couple of
options that the user will see. For the people.pl file, depending on the entry point, if it
matches any functions by getting by first name, or last name …etc.
scripts contain functions to execute other scripts. As an
example, the main menu is called in VEPD.pl and depending what is needed by the user,
a function will be called to either retrieve, add, view or quit program.
(* Note: reference the appendix for th
e scripts, its functionality and documentation)
White, George M. "Natural Language understanding and Speech Recognition."
Communications of the ACM
33 (1990): 74
Osada, Hiroyasu. "Evaluation Method for a Voice Recognition Sy
stem Modeled with
Discrete Markov Chain."
Bradford, James H. "The Human Factors of Speech
Based Interfaces: A Research
27 (1995): 61
Shneiderman, Ben. "The Limits of Speech Recognition."
Communication of the
43 (2000): 63
Danis, Catalina, and John Karat. "Technology
Driven Design of Speech Recognition
Suhm, Bernhard, et al. "Multimodal Error Correction for Speech User Interfaces"
ACM Transactions on Computer
8 (2001) 60
Brown, M.G., et al. "Open
Vocabulary Speech Indexing for Voice and Video Mail
ACM Multimedia 96
Christian, Kevin., et al. "A Comparison of Voice Controlled and Mouse Controlled
Falavigna, D., et al.
"Analysis of Different Acoustic front
ends for Automatic voice
over IP Recognition" Italy 2001.
Simons, Sheryl P. "Voice Recognition Market Trends"
Faulkner Information Services
(11) Becchetti, Claudio, and Lucio Prina Rico
Speech Recognition: Theory and
. New York : 1999
Abbott, Kenneth R.
Voice Enabling Web Applications: VoiceXML and Beyond
VOiceXML: 10 Projects to Voice Enable Your Web Site
. New York:
Syrdal, A., et
Applied Speech Technology
Ann Arbor: CRC 1995
Larson, James A.
VoiceXML:Introduction to Developing Speech Applications
Jersey : 2003