AUTOMATIC SPEECH RECOGNITION: HUMAN COMPUTER INTERFACE FOR KINYARWANDA LANGUAGE

cheesestickspiquantAI and Robotics

Nov 17, 2013 (3 years and 6 months ago)

211 views

AUTOMATIC SPEECH RECOGNITION:
HUMAN COMPUTER INTERFACE FOR
KINYARWANDA LANGUAGE
Muhirwe Jackson
BSc (Mak)
A Project Report Submitted in Partial Fulfilment of the
Requirements for the Award of the Degree
Master of Science in Computer Science of
Makerere University
August 2005
Title page
i
DECLARATION
I,Muhirwe Jackson,do hereby declare this project report as my original work and has never
been submitted for any award of a degree in any institution of higher learning.
Signed:..........................................................Date:...........................................
Muhirwe Jackson,
Candidate.
ii
APPROVAL
This report has been submitted for examination with my approval as supervisor.
Signed:..........................................................Date:...........................................
Dr.Jehopio Peter,Ph.D.
Supervisor.
iii
DEDICATION
To the prince of peace,my Lord and savior Jesus Christ
Let it be said of me that my source of strength is Christ alone.
To my wife,Yvonne Muhirwe
who has greatly encouraged and supported me during my studies.
To my children:
who always bring joy to my life.
To my mum:Ms Mukankuliza Joyce
Who has wonderfully supported and encouraged throughout my education:There’s no
mother like you.
To my Brothers and sister
I love you all
iv
”I can do all things through Christ
which strengtheneth me.” Phil 4:13
KJV
v
ACKNOWLEDGEMENT
Success in life is never attained single handedly.I would like to express my heartfelt grati-
tude to my God almighty who revealed Himself to me through the Holy spirit and has since
been my source of strength and wisdom.I wish to extend thanks to my supervisor,Dr.
Peter Jehopio for the professional guidance that has enabled me accomplish this research.
I also wish to extend my sincere thanks to the dean Faculty of computing and Information
technology Dr.Baryamureeba Venansius for all the support he has provided to me both
morally and financially without which this project may not have been a success.
I would like to appreciate my wife,Yvonne Muhirwe for being such a wonderful,loving and
understanding wife.Thanks for giving me space and time to dedicate to my studies,my
success is your success.
I extend my thanks and appreciation to the Rector of the Kigali Institute of Education,Mr.
Mudidi Emmanuel for having faith in me and for all the support he provided to me at the
beginning of the course.I extend my thanks and appreciation to the Rwandan Government
through the Student Financing Agency for Rwanda (SFAR) for sponsoring me for the entire
course.
My sincere appreciation goes to Makerere University Faculty of computing and Information
Technology staff,more especially Paul Bagenda,and Kanagwa Ben for their technical sup-
port.
Lastly but not least,I acknowledge all my lecturers and all my classmates on the computer
science programmes for having made my academic and social life comfortable at Makerere
University.
MAY GOD BLESS YOU ABUNDANTLY
vi
LIST OF
ACRONYMS/ABBREVIATIONS
LVCSR Large Vocabulary Continuous Speech Recognition.
ASR Automatic speech recognition
TTS Text-to-speech
IVR Interactive Voice Response
HCI Human Computer Interaction
I/O Input and Output
SU Speech Understanding
GUI Graphical User Interface
DVI Direct Voice Input
HMM Hidden Markov Models
HTK Hidden Markov Model Toolkit.
BNF Backus-Naur form
SLF Standard Lattice Format
MLF Master Label Files
MFCC Mel Frequency Cepstral Coefficients.
vii
Contents
TITLE PAGE......................................i
DECLARATION....................................ii
APPROVAL.......................................iii
DEDICATION.....................................iv
ACKNOWLEDGEMENT...............................vi
LIST OF ACRONYMS/ABBREVIATIONS.....................vii
LIST OF FIGURES..................................xi
ABSTRACT......................................xii
1 INTRODUCTION 1
1.1 Background to the Study.............................1
1.2 Statement of the Problem............................3
1.3 Objectives of the Study..............................3
1.3.1 General Objective.............................3
1.3.2 Specific Objectives............................3
1.4 Scope.......................................4
1.5 Significance of the Study.............................4
2 Literature Review 6
2.1 Current State of ASR Technology and its Implications for Design......6
2.2 Types of ASR...................................8
2.3 Speech Recognition Techniques.........................9
2.4 Matching Techniques...............................10
viii
2.5 Corpora......................................11
2.6 Problems in Designing Speech Recognition Systems..............11
2.7 Similar Projects Carried out...........................12
3 METHODOLOGY 14
3.1 Data Preparation.................................16
3.1.1 The Task Grammar............................16
3.1.2 A Pronunciation Dictionary.......................18
3.1.3 Recording.................................19
3.1.4 Phonetic Transcription..........................21
3.1.5 Encoding the Data............................22
3.2 Parameter Estimation (Training)........................23
3.2.1 Training Strategies............................23
3.2.2 HMM Definition..............................25
3.2.3 HMM Training..............................27
3.2.4 Training..................................29
3.3 Recognition....................................29
3.4 Running the Recognizer Live...........................31
4 RESULTS 32
4.1 Perfomance Test.................................32
4.2 Perfomance Analysis...............................32
4.3 Testing the System on Live Data........................33
5 DISCUSSION,CONCLUSION AND RECOMMENDATIONS 35
5.1 Discussion.....................................35
5.2 Conclusion.....................................37
5.3 Areas for Further Study.............................37
REFERENCES 39
ix
APPENDICES 43
Appendix A:Word Network..............................43
Appendix B:Training Sentences............................45
Appendix C:Master Label Files............................47
Appendix D:Training Data..............................62
Appendix E:HMM Definitions............................67
Appendix F:VarFloor1.................................81
Appendix G:Recognition Output...........................82
Appendix H:Testing Data...............................86
x
List of Figures
3.1 Components of an ASR system.........................15
3.2 Grammar for voice dialling............................17
3.3 Process of creating a word lattice........................17
3.4 Recording and labelling data using hslab....................20
3.5 Training HMMs..................................24
3.6 Training isolated whole word models......................25
3.7 HMM training process..............................27
3.8 Speech recognition process............................30
4.1 Speech recognition results............................33
4.2 Live data recognition results...........................34
xi
ABSTRACT
The main purpose of the study was to develop an automatic speech recogniser for Kin-
yarwanda language.The products of the study include an automatic phone dialling speech
corpus,a Kinyarwanda digit speech recogniser,a recipe for building HMM speech recognis-
ers,especially for Kinyarwanda language.
Two different corpora were collected of audio recordings of indigenous Kinyarwanda lan-
guage speakers,in which subjects read aloud numeric digits.One of the collected corpora
contained the trainig data and the other the testing data.
The system was implemented using the HMM toolkit HTK by training HMMs of the words
making the vocabulary on the training data.The trained system was tested on data other
than the training data and results revealed that 94.47% of the tested data were correctly
recognized.
The developed system can be used by developers and researchers interested in speech recog-
nition for Kinyarwanda language and any other related African language.The findings
of the study can be generalized to cater for large vocabularies and for continuous speech
recognition.
xii
Chapter 1
INTRODUCTION
1.1 Background to the Study
Speech is one of the oldest and most natural means of information exchange between human
beings.We as humans speak and listen to each other in human-human interface.For
centuries people have tried to develop machines that can understand and produce speech
as humans do so naturally (Pinker,1994 [20];Deshmukh et al.,1999 [5]).Obviously such
an interface would yield great benefits (Kandasamy,1995,) [12].Attempts have been made
to develop vocally interactive computers to realise voice/speech recognition.In this case a
computer can recognize text and give out a speech output (Kandasamy,1995) [12].
Voice/speech recognition is a field of computer science that deals with designing computer
systems that recognize spoken words.It is a technology that allows a computer to identify
the words that a person speaks into a microphone or telephone.
Speech recognition can be defined as the process of converting an acoustic signal,captured
by a microphone or a telephone,to a set of words (Zue et al.,1996 [36];Mengjie,2001 [17]).
Automatic speech recognition (ASR) is one of the fastest developing fields in the framework of
speech science and engineering.As the new generation of computing technology,it comes as
the next major innovation in man-machine interaction,after functionality of text-to-speech
(TTS),supporting interactive voice response (IVR) systems.
The first attempts (during the 1950s) to develop techniques in ASR,which were based on
the direct conversion of speech signal into a sequence of phoneme-like units,failed.The
1
first positive results of spoken word recognition came into existence in the 1970s,when gen-
eral pattern matching techniques were introduced.As the extension of their applications
was limited,the statistical approach to ASR started to be investigated,at the same period.
Nowadays,the statistical techniques prevail over ASR applications.Common speech recog-
nition systems these days can recognize thousands of words.The last decade has witnessed
dramatic improvement in speech recognition technology,to the extent that high performance
algorithms and systems are becoming available.In some cases,the transition from labora-
tory demonstration to commercial deployment has already begun (Zue et al.,1996) [36].The
reason for the evolution of ASR,hence improved is that it has a lot of applications in many
aspects of our daily life,for example,telephone applications,applications for the physically
handicapped and illiterates and many others in the area of computer science.Speech recog-
nition is considered as an input as well as an output during the Human Computer Interaction
(HCI) design.HCI involves the design implementation and evaluation of interactive systems
in the context of the users’ task and work.(Dix et al.,1998) [6].
The list of applications of automatic speech recognition is so long and is growing;some
of known applications include virtual reality,Multimedia searches,auto-attendants,travel
information and reservation,translators,natural language understanding and many more
applications (Scansoft,2004 [27];Robertson,1998 [24]).
Speech technology is the technology of today and tomorrow with a developing number of
methods and tools for better implementation.Speech recognition has a number of practical
implementations for both fun and serious works.Automatic speech recognition has an
interesting and useful implementation in expert systems,a technology whereby computers
can act as a substitute for a human expert.An intelligent computer that acts,responds or
thinks like a human being can be equipped with an automatic speech recognition module
that enables it to process spoken information.Medical diagnostic systems,for example,can
diagnose a patient by asking him a set of questions,the patient responding with answers,
and the system responds with what might be a possible disease.
2
1.2 Statement of the Problem
As the use of ICT tools,especially the computer,is becoming inevitable,there are many
Rwandans who are left out due to inadequate human computer interface (HCI) design con-
siderations.A case in point is the many Rwandans who are left out due to language barrier
(Earth trends,2003) [8].These people can only read and write in their mother-tongue,
Kinyarwanda language making it impossible for them to use ICT conventional tools that are
built in the two International languages,English and French used in Rwanda.
The purpose of this project was therefore to design and train a speech recognition system
that could be used by application developers to develop application that will take indige-
nous Kinyarwanda language speakers aboard the current information and communication
technologies to fast-track the benefits of ICT.
1.3 Objectives of the Study
1.3.1 General Objective
The general objective of the project was to develop an automatic speech recogniser for
Kinyarwanda language.
1.3.2 Specific Objectives
The specific objectives of the project are:
i.To critically review literature related to ASR.
ii.To identify speech corpus elements exhibited in African languages such as Kinyarwanda
language.
iii.To build a Kinyarwanda language speech corpus for a voice operated telephone system.
iv.To implement an isolated whole word speech recognizer that is capable of recognizing
and responding to speech.
3
v.To train the above developed system in order to make it speaker independent.
vi.To validate the automatic speech recognizer developed during the study.
1.4 Scope
The project was limited to only isolated whole words and trained and tested on only one
(1) word sentences consisting of the numeric digit 0 to 9 that could be used on operating a
voice operated telephone system.
Human speech is inherently a multi modal process that involves the analysis of the uttered
acoustic signal and includes higher level knowledge sources such as grammar semantics and
pragmatics (Dupont,2000) [7].This research intends to focus only on the acoustic signal
processing ignoring the visual input.
1.5 Significance of the Study
The proposed research has theoretical,practical,and methodological significance:
i.The speech corpus developed will be very useful to any researcher who may wish to
venture into Kinyarwanda language automatic speech recognition.
ii.By developing and training a speech recognition system in Kinyarwanda language,the
semi illiterates would be able to use it in accessing IT tools.This would help bridge
the digital divide,since Rwanda is a monolingual nation with a population of about
8Million (Earth trends,2003) [8] all speaking Kinyarwanda language.
iii.Since Speech technology is the technology of today and tomorrow,the results of this
research will help many indigenous Kinyarwanda language speakers who are scattered
all over the great lakes region to take advantage of the many benefits of ICT.
iv.The technology will find applicability in systems such as banking,telecommunications,
transport,Internet portals,accessing PC,emailing,administrative and public services,
cultural centres and many others.
4
v.The built system will be very useful to computer manufactures and software developers
as they will have a speech recognition engine to include Kinyarwanda language in their
applications.
vi.By developing and training a speech recognition system in Kinyarwanda language,it
would mark the first step towards making ICT tools become more usable by the blind
and elderly people with seeing disabilities.
5
Chapter 2
Literature Review
Human computer interactions as defined in the background is concerned about ways Users
(humans) interact with the computers.Some users can interact with the computer using the
traditional methods of a keyboard and mouse as the main input devices and the monitor
as the main output device.Due to one reason or another some users cannot be able to
interact with machines using a mouse and keyboard (Rudnicky et al.,1993) [26] device,
hence the need for special devices.Speech recognition systems help users who in one way
or the other can not be able to use the traditional Input and Output(I/O) devices.For
about four decades human beings have been dreaming of an ”intelligent machine” which
can master the natural speech (Picheny,2002) [19].In its simplest form,this machine
should consist of two subsystems,namely automatic speech recognition (ASR) and speech
understanding (SU) (Reddy,1976) [23].The goal of ASR is to transcribe natural speech
while SU is to understand the meaning of the transcription.Recognizing and understanding
a spoken sentence is obviously a knowledge-intensive process,which must take into account
all variable information about the speech communication process,fromacoustics to semantics
and pragmatics.
2.1 Current State of ASR Technology and its Implica-
tions for Design
The design of user interfaces for speech-based applications is dominated by the underlying
ASR technology.More often than not,design decisions are based more on the kind of recog-
6
nition the technology can support rather than on the best dialogue for the user (Mane et
al.,1996) [16].The type of design will depend,broadly,on the answer to this question:
What type of speech input can the system handle,and when can it handle it?When isolated
words are all the recognizer can handle,then the success of the application will depend on the
ability of designers to construct dialogues that lead the user to respond using single words.
Word spotting and the ability to support more complex grammars opens up additional flex-
ibility in the design,but can make the design more difficult by allowing a more diverse set
of responses from the user.Some current systems allow a limited form of natural language
input,but only within a very specific domain at any particular point in the interaction.
Even in these cases,the prompts must constrain the natural language within acceptable
bounds.No systems allow unconstrained natural language interaction,and it’s important
to note that most human-human transactions over the phone do not permit unconstrained
natural language either.Typically,a customer service representative will structure the con-
versation by asking a series of questions.
With ”barge-in” (also called ”cut-through”) (Mane et al.,1996,) [16],a caller can interrupt
prompts and the system will still be able to process the speech,although recognition per-
formance will generally be lower.This obviously has a dramatic influence on the prompt
design,because when barge-in is available it’s possible to write longer more informative
prompts and let experienced users barge-in.Interruptions are very common in human-
human conversations,and in many applications,designers have found that without barge-in
people often have problems.There are a variety of situations,however,in which it may not
be possible to implement barge-in.In these cases,it is still usually possible to implement
successful applications,but particular care must be taken in the dialogue design and error
messages.Another situation in which technology influences design involves error recovery.
It is especially frustrating when a system makes the same mistake twice,but when the active
vocabulary can be updated dynamically,recognizer choices that have not been confirmed can
be eliminated,and the recognizer will never make the same mistake twice.Also,when more
than one choice is available (this is not always the case,as some recognizers return only the
top choice),then after the top choice is disconfirmed,the second choice can be presented.
7
2.2 Types of ASR
ASR products have existed in the marketplace since the 1970s.However,early systems
were expensive hardware devices that could only recognize a few isolated words (i.e.words
with pauses between them),and needed to be trained by users repeating each of the vo-
cabulary words several times.The 1980s and 90s witnessed a substantial improvement in
ASR algorithms and products,and the technology developed to the point where,in the late
1990s,software for desktop dictation became available ’off-the-shelf’ for only a few tens of
dollars.From a technological perspective it is possible to distinguish between two broad
types of ASR:’direct voice input’ (DVI) and ’large vocabulary continuous speech recogni-
tion’ (LVCSR).DVI devices are primarily aimed at voice command-and-control,whereas
LVCSR systems are used for form filling or voice-based document creation.In both cases
the underlying technology is more or less the same.DVI systems are typically configured
for small to medium sized vocabularies (up to several thousand words) and might employ
word or phrase spotting techniques.Also,DVI systems are usually required to respond im-
mediately to a voice command.LVCSR systems involve vocabularies of perhaps hundreds
of thousands of words,and are typically configured to transcribe continuous speech.Also,
LVCSR need not be performed in real-time - for example,at least one vendor has offered a
telephone-based dictation service in which the transcribed document is e-mailed back to the
user.
Specific examples of application of ASR may include but not limited to the following
i.large vocabulary dictation - for RSI sufferers and quadriplegics,and for formal docu-
ment preparation in legal or medical services.
ii.Interactive voice response - for callers who do not have tone pads,for the automation
of call centers,and for access to information services such as stock market quotes.
iii.Telecom assistants - for repertory dialling and personal management systems.
iv.Process and factory management - for stocktaking,measurement and quality control.
8
2.3 Speech Recognition Techniques
Speech recognition techniques are the following:
i.Template based approaches matching (Rabiner et al.,1979) [22],Unknown speech is
compared against a set of pre-recorded words( templates) in order to find the best
match.This has the advantage of using perfectly accurate word models.But it also
has the disadvantage that pre-recorded templates are fixed,so variations in speech
can only be modelled by using many templates per word,which eventually becomes
impractical.Dynamic time warping is such a typical approach (Tolba et al.,2001) [31].
In this approach,the templates usually consists of representative sequences of features
vectors for corresponding words.The basic idea here is to align the utterance to each of
the template words and then select the word or word sequence that contains the best.
For each utterance,the distance between the template and the observed feature vectors
are computed using some distance measure and these local distances are accumulated
along each possible alignment path.The lowest scoring path then identifies the optimal
alignment for a word and the word template obtaining the lowest overall score depicts
the recognised word or sequence of words.
ii.Knowledge based approaches:An expert knowledge about variations in speech is hand
coded into a system.This has the advantage of explicit modelling variations in speech;
but unfortunately such expert knowledge is difficult to obtain and use successfully.
Thus this approach was judged to be impractical and automatic learning procedure
was sought instead.
iii.Statistical based approaches.In which variations in speech are modelled statistically,
using automatic,statistical learning procedure,typically the Hidden Markov Models,
or HMM.The approach represent the current state of the art.The main disadvantage
of statistical models is that they must take priori modelling assumptions which are
liable to be inaccurate,handicapping the system performance.In recent years,a new
approach to the challenging problem of conversational speech recognition has emerged,
holding a promise to overcome some fundamental limitations of the conventional Hid-
den Markov Model (HMM) approach (Bridle et al.,1998 [2];Ma and Deng,2004 [14]).
9
This new approach is a radical departure fromthe current HMM-based statistical mod-
eling approaches.Rather than using a large number of unstructured Gaussian mixture
components to account for the tremendous variation in the observable acoustic data of
highly coarticulated spontaneous speech,the new speech model that (Ma and Deng,
2004) [15] have developed provides a rich structure for the partially observed (hidden)
dynamics in the domain of vocal-tractresonances.
iv.Learning based approaches.To overcome the disadvantage of the HMMs machine
learning methods could be introduced such as neural networks and genetic algorithm/
programming.In those machine learning models explicit rules or other domain expert
knowledge) do not need to be given they a can be learned automatically through
emulations or evolutionary process.
v.The artificial intelligence approach attempts to mechanise the recognition procedure
according to the way a person applies its intelligence in visualizing,analysing,and
finally making a decision on the measured acoustic features.Expert system are used
widely in this approach (Mori et al.,1987) [18]
2.4 Matching Techniques
Speech-recognition engines match a detected word to a known word using one of the following
techniques (Svendsen et al.,1989) [29].
i.Whole-word matching.The engine compares the incoming digital-audio signal against
a prerecorded template of the word.This technique takes much less processing than
sub-word matching,but it requires that the user (or someone) prerecord every word
that will be recognized - sometimes several hundred thousand words.Whole-word
templates also require large amounts of storage (between 50 and 512 bytes per word)
and are practical only if the recognition vocabulary is known when the application is
developed.
ii.Sub-word matching.The engine looks for sub-words - usually phonemes - and then
performs further pattern recognition on those.This technique takes more processing
10
than whole-word matching,but it requires much less storage (between 5 and 20 bytes
per word).In addition,the pronunciation of the word can be guessed from English
text without requiring the user to speak the word beforehand.
(Svendsen et al.,1989) [29],(Rabiner et al.,1981) [22],and (Wilpon et al.,1988) [34] discuss
that research in the area of automatic speech recognition had been pursued for the last three
decades,only whole-word based speech recognition systems have found practical use and
have become commercial successes.Though whole word models have become a success the
researchers mentioned above all agree that they still suffer from two major problems,that
is co-articulation problems and requiring a lot of training to build a good recognizer.
2.5 Corpora
To build any speech engine whether a speech recognition engine or speech sythensis engine
you need a corpus.Corpora are any collections of text and/or speech,and are used as a
basis of statistical processing of natural language (Jurafsky and Martin,2000) [10].There
are various kinds of corpora:tagged or untagged;monolingual or multilingual;balanced or
specialized.For example,one of the largest and best-known corpora,the British National
Corpus (Warwick,1997) [32],consists of 100 million words of written (about 90%) and speech
(about 10%) data collected from modern British English which covers a variety of styles and
subjects.Speech corpus could be specialised with only telephone data (Cole et al.,1992) [4],
names,names of places,etc.Developing a speech corpus may involve data collection and
transcription(Cole et al.,1994) [3].
2.6 Problems in Designing Speech Recognition Sys-
tems
ASR has been proved to be a not easy task.According to (Rudnicky et al.,1993) [26] the
main challenge in the implementation of ASR on desktops is the current existence of mature
and efficient alternatives,the keyboard and mouse.In the past years,speech researchers
have found several difficulties that contrast with the optimism of the first speech technology
11
pioneers.According to Ray Reddy (Reddy,1976) [23] in his review of speech recognition
by machines says that the problems in designing ASR are due to the fact that it is related
to so many other fields such as acoustics,signal processing,pattern recognition,phonetics,
linguistics,psychology,neuroscience,and computer science.And all these problems can be
described according to the tasks to be performed.
i.Number of speakers:With more than one speaker,an ASR system must cope with
the difficult problem of speech variability from one speaker to another.This is usually
achieved through the use of large speech database as training data (Huang et al.,
2004) [9].
ii.Nature of the utterance:Isolated word recognition impose on the speaker the need to
insert artificial pause between successive utterances.Continuous speech recognition
systems are able to cope with natural speech utterances in which words may be tied
together and may at times be strongly affected by co articulation.Spontaneous speech
recognition systems allow the possibility of pause and false starts in the utterance,the
use of words not found in the lexicon,etc.
iii.Vocabulary size:In general,increasing the size of the vocabulary decrease the recog-
nition scores.
iv.Differences between speakers due to sex,age,accent and so on.
v.Language complexity:The task of continuous speech recognisers is simplified by limit-
ing the number of possible utterances through the imposition of syntactic and semantic
constraints.
vi.Environment conditions:The sites for real applications often present adverse conditions
(such as noise,distorted signal,and transmission line variability) which can drastically
degrade the system performance.
2.7 Similar Projects Carried out
African Speech Technology is the working title of a 3-year project promoting the development
of the official languages of South Africa through language and speech technology applications
12
at the University of Stellenbosch.So far they have covered South African English,isiZulu,
isiXhosa,Sesotho and Afrikaans (Roux et al.,2000) [25].While African Speech Technology
and other research centers are engaged in speech technology research,there is still a long
way to go in automatic speech recognition of many indigenous languages in Africa.Most of
what is done in automatic speech recognition worldwide revolves around the many English
dialects and major languages of the northern hemisphere.
13
Chapter 3
METHODOLOGY
This chapter gives a full description of how the Kinyarwanda language speech recognition
system was developed.The goal of the project was to build a robust whole word recognizer.
That means it should be able to generalise both from speaker specific properties and its
training should be more than just instance based learning.In the HMM paradigm this is
supposed to be the case,but the researcher intended to put this into practice.
As the time scope was limited and to be able to focus on more specific issues than HMM in
general,the Hidden Markov Model toolkit (HTK) was used.HTK is a toolkit for building
Hidden Markov Models (HMMs).HMMs can be used to model any time series and the
core of HTK is similarly general-purpose.However,HTK is primarily designed for building
HMM-based speech processing tools,in particular recognisers (Young S.et al.,2002) [35].
Secondly to reduce the difficulties of the task,a very limited language model was used.
Future research can be directed to more extensive language models.In ASRsystems acoustic
information is sampled as a signal suitable for processing by computers and fed into a
recognition process.The output of the systemis a hypothesis transcription of the utterances.
14
Figure 3.1:Components of an ASR system
Speech recognition is a complicated task and state of the art recognition systems are very
complex.For pragmatic reasons the project was restricted to the same domain as the HTK
tutorial suggests namely instructions that a telephone can perform,”Dial one two zero”.
System construction approach.There are a big number of different approaches for the
implementation of an ASR but for this project the four major processing steps as suggested
by HTK (Young S.et al.,2002,) [35] were considered namely data preparation,training,
Recognition/testing and analysis.For implementation purposes the following sub-processes
were taken
i.Building the task grammar
ii.Constructing a dictionary for the models
iii.Recording the data.
iv.Creating transcription files for training data
v.Encoding the data (feature processing)
vi.(Re-) training the acoustic models
vii.Evaluating the recognisers against the test data
viii.Reporting recognition results
15
3.1 Data Preparation
The first stage of any recogniser development project is data preparation.Speech data is
needed both for training and for testing.In the system built here,all of this speech was
recorded fromscratch.The training data is used during the development of the system.Test
data provides the reference transcriptions against which the recogniser’s performance can be
measured and a convenient way to create them is to use the task grammar as a random
generator.In the case of the training data,the prompt scripts will be used in conjunction
with a pronunciation dictionary to provide the initial phone level transcriptions needed to
start the HMM training process.
It follows from above that before the data can be recorded,a phone set must be defined,a
dictionary must be constructed to cover both training and testing and a task grammar must
be defined.
3.1.1 The Task Grammar
The task grammar defines constraints on what the recognizer can expect as input.As the
system built provides a voice operated interface for phone dialling,it handles digit strings.
For the limited scope of this project,only a the digits 0,1,9 making toy grammar were
needed.The grammar was defined in BN-form,as follows:$variable defines a phrase as
anything between the subsequent = sign and the semicolon,where | stands for a logical or.
Brackets have the usual grouping function and square brackets denote optionality.The used
toy grammar was:
#
#Task grammar
#
$digit=RIMWE|KABIRI|GATATU|KANE|GATANU|GATANDATU|KARINDWI|UMUNANI|ICYENDA|ZERO;
(SENT-START[$digit] SENT-END)
The above grammar can be depicted as a network as shown below
16
Figure 3.2:Grammar for voice dialling
Word network
The above high-level representation of a task grammar is provided for user convenience.The
HTK recogniser actually requires a word network to be defined using a low level notation
called HTK Standard Lattice Format (SLF) in which each word instance and each word-to-
word transition is listed explicitly.This word network can be created automatically fromthe
grammar above using the HParse tool,thus assuming that the file gram contains the above
grammar,executing
HParse gram wdnet
Creates an equivalent word network in the file wdnet (appendix A) see the figure below
Figure 3.3:Process of creating a word lattice
17
The above created lattice can now be used by another HTK tool HSGen to generate random
sentences.These are the sentences that are used later for training and testing purposes.
3.1.2 A Pronunciation Dictionary
The dictionary provides an association between words used in the task grammar and the
acoustic models which may be composed of sub word (phonetic,sysllabic etc,,) units.Since
this project provides a voice operated interface the dictionary could have been constructed
by hand but the researcher wanted to try a different method which could be used to con-
struct a dictionary for a large vocabulary ASR system.In order to train the HMM network,
a large pronunciation dictionary is needed.
Since we are using whole-word models in this assignment,the dictionary has a simple struc-
ture.A file called ’lexicon’ was created with the following structure:
GATANDATU gatandatu
GATANU gatanu
GATATU gatatu
ICYENDA icyenda
KABIRI kabiri
KANE kane
KARINDWI karindwi
RIMWE rimwe
SENT-END [] sil
SENT-START [] sil
UMUNANI umunani
ZERO zero
A file named wdlist.txt was created containing all the words that make up the vocabulary.
GATANDATU
GATANU
GATATU
ICYENDA
18
KABIRI
KANE
KARINDWI
RIMWE
SENT-END
SENT-START
UMUNANI
ZERO
The dictionary was created finally by using HDman as follows
HDman -m -w wdlist.txt -n models1 -l dlog dict lexicon
This will create a new dictionary called dict by searching the source dictionary(s) lexicon to
find pronunciations for each word in wdlist.txt.Here,the wdlist.txt in question needs only
to be a sorted list of the words appearing in the task grammar given above.The option
-l instructs HDMan to output a log file dlog which contains various statistics about the
constructed dictionary.In particular,it indicates if there are words missing.HDMan can
also output a list of the words used,here called models1.Once training and test data has
been recorded,an HMM will be estimated for each of these words.
3.1.3 Recording
In order to train and test the recognizer on the domain and on the voices of some selected
people,10 sentences were automatically generated from the grammar with HTK’s HSGen.
See appendix B for the training and testing sentences.Speech data of six (6) different
speakers 3 males and 3 females of different age groups was recorded.Due to my lack of
access to a recording studio,the recordings were done in an office on Sundays when there
are no people in the office.As the toolkit does not require phoneme duration information
for the training sentences,the (differences in) timing in the pronunciation of the training
sentences is not important.The toolkit learns to recognise the words through fitting the
word transcriptions on the training set.These transcriptions are used for all realisations of
the same sentence,even though there might be variation between speakers relative to the
transcription.
The speakers were given a list with sentences which they had to read aloud.After about 5
19
sentences they took a short break,and drank a glass of water.The training corpus consisting
of 150 sentences were recorded and labelled using the HTK tool HSLab.
Figure 3.4:Recording and labelling data using hslab
After recording and labelling the training sentences,a test corpus was also created the same
way as the training corpus but in this case 70 sentences were used for training.The differences
noted in pronunciation between speakers (and their consequences) can be categorised as
articulation variation E.g.,some speakers had a rolling ‘r’,others not in,example,‘kabiri’,
’rimwe’ Phonetic change degrades the quality of the training set,since the same phonetic
transcription was used for all speakers.These phonetic changes problems were solved by
using isolated whole word models and having many different sentences such at the end of
the day I created a speaker independent system.
Articulation variation on the other hand is of course a problem for recognition but if there
20
was no articulation variation the task of recognising would become an instance based learning
problem.
3.1.4 Phonetic Transcription
For training,we need to tell the recognizer which files correspond to what digit.HTK uses
the so-called Master Label Files (MLF) to store information associated to speech.What
makes things a bit confusing is the fact that there are two things an MLF can contain:
words and phonemes.In the tutorial the usages of various HTK tools are shown that can
convert lists of sentences into lists of words and then lists of phonemes,the last two in
an MLF.Since the objective of this project was to create an isolated word recognizer,a
file called source.mlf was created associating each recorded and labelled speech data with a
word.#!MLF!#”data/train/rimwe01.lab”
RIMWE
.
”data/train/rimwe02.lab”
RIMWE
.
Etc..
See appendix C for details
It is assumed that rimwe01.WAV contains the utterance ’rimwe’,and so on.Next,the model
transcriptions must be obtained.For this,create an HTK edit script called ’mkphones0.led’
containing the following:
EX
IS sil sil
DE sp
The HTK tool HLed was used to the word transcriptions into model transcriptions (mod-
els0.mlf):
HLEd -d dict -i models0.mlf mkphones0.led source.mlf
21
3.1.5 Encoding the Data
The speech recognition tools cannot process directly on speech waveforms.These have to be
represented in a more compact and efficient way.This step is called ”acoustical analysis”:
The signal is segmented in successive frames (whose length is chosen between 20ms and
40ms,typically),overlapping with each other.Each frame is multiplied by a windowing
function (e.g.Hamming function).
A vector of acoustical coefficients (giving a compact representation of the spectral properties
of the frame) is extracted from each windowed frame.
In order to specify to HTK the nature of the audio data (format,sample rate,etc.) and
feature extraction parameters (type of feature,window length,pre-emphasis,etc.),a config-
uration file (config.txt) was created as follows:
#Coding parameters
SOURCEKIND = waveform
SOURCEFORMAT = HTK
SOURCERATE = 625
TARGETKIND = MFCC
0
D
A
TARGETRATE = 100000.0
SAVECOMPRESSED = T
SAVEWITHCRC = T
WINDOWSIZE = 250000.0
USEHAMMING = T
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12
ENORMALISE = F
To run a HCopy a list of each source file and its corresponding output file was created.The
first few lines look like:
data/train/rimwe01.SIG data/MFC/rimwe.MFC
data/train/rimwe02.sig data/MFC/rimwe02.MFC
data/train/rimwe03.sig data/mfc/rimwe03.mfc
22
.
.
data/train/sil10.sig data/MFC/sil10.MFC
See appendix D for details
One line for each file in the training set.This file tells HTK to extract features from each
audio file in the first column and save them to the corresponding feature file in the second
column.The command used is:
HCopy -T 1 -C config.txt -S hcopy.scp
3.2 Parameter Estimation (Training)
Defining the structure and overall form of a set of HMMs is the first step towards building
a recognizer.The second step is to estimate the parameters of the HMMs from examples of
the data sequences that they are intended to model.This process of parameter estimation is
usually called training.The topology for each of the hmm to be trained is built by writing
a prototype definition.HTK allows HMMs to be built with any desired topology.HMM
definitions can be stored externally as simple text files and hence it is possible to edit them
with any convenient text editor.With the exception of the transition probabilities,all of
the HMM parameters given in the prototype definition are ignored.The purpose of the
prototype definition is only to specify the overall characteristics and topology of the HMM.
The actual parameters will be computed later by the training tools.Sensible values for the
transition probabilities must be given but the training process is very insensitive to these.
An acceptable and simple strategy for choosing these probabilities is to make all of the
transitions out of any state equally likely.In principle the HMM should be tested on a large
corpus containing wide range of word pronunciations.For this purpose 150 sentences were
recorded and labelled as stated above see the training corpus CD for training data.
3.2.1 Training Strategies
HTK offers two different approaches to training speech data
23
Figure 3.5:Training HMMs
Firstly,an initial set of models must be created.If there is some speech data available
for which the location of the word boundaries have been marked,then this can be used
as bootstrap data.In this case,the tools HInit and HRest provide isolated word style
training using the fully labeled bootstrap data.Each of the required HMMs is generated
individually.HInit reads in all of the bootstrap training data and cuts out all of the examples
of the required phone.It then iteratively computes an initial set of parameter values using
a segmental k-means procedure.
On the first cycle,the training data is uniformly segmented,each model state is matched with
the corresponding data segments and then means and variances are estimated.If mixture
Gaussian models are being trained,then a modified form of k-means clustering is used.On
the second and successive cycles,the uniform segmentation is replaced by Viterbi alignment.
The initial parameter values computed by HInit are then further re-estimated by HRest.
Since this project we were interested in isolated whole word the following strategy was used
as described above.
If there’s no marked data,the tool HCompV is used.In this project since all the data was
labelled,then HInit and HRest were used for training purposes.
24
Figure 3.6:Training isolated whole word models
3.2.2 HMM Definition.
The first step in HMMtraining is to define a prototype model.The purpose of the prototype
is to define a model topology on which all the other models can be based.In HTK a HMM
is a description file and in this case it is
~o
<VecSize>39
<MFCC_0_D_A>
~h"proto"
<BeginHMM>
<NumStates> 6
<State> 2
25
<Mean> 39
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
<Variance> 39
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
<State> 3
<Mean> 39
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
<Variance> 39
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
<State> 4
<Mean> 39
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
<Variance> 39
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
<State> 5
<Mean> 39
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
26
<Variance> 39
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
<TransP> 6
0.0 0.5 0.5 0.0 0.0 0.0
0.0 0.4 0.3 0.3 0.0 0.0
0.0 0.0 0.4 0.3 0.3 0.0
0.0 0.0 0.0 0.4 0.3 0.3
0.0 0.0 0.0 0.0 0.5 0.5
0.0 0.0 0.0 0.0 0.0 0.0
<EndHMM>
Models for each of the events were also constructed,see appendix E for the details.
3.2.3 HMM Training
The training described in the parameter estimation introduction can be summarized in a
diagram form as below.
Figure 3.7:HMM training process
27
Initialisation
The HTK tool HInit was used to initialize the data as given below
HInit -A -D -T 1 -S train.scp -M model/hmm0 -H hmmfile -l label -L label
dir nameofhmm
where:
nameofhmm is the name of the HMM to initialise (here:yes,no,or sil).hmmfile is a
description file containing the prototype of the HMMcalled nameofhmm (here:hmm
rimwe,
hmm
kabiri,e.t.c).
trainlist.txt gives the complete list of the.mfcc files forming the training corpus (stored in
directory data/train/mfc).
label
dir is the directory where the label files (.lab) corresponding to the training corpus
(here:data/train/lab/).
label indicates which labelled segment must be used within the training corpus (here:rimwe,
kabiri,etc..
model/hmm0 is the name of the directory (must be created before) where the resulting
initialised HMM description will be output.
This procedure has to be repeated for each model (hmm
rimwe,hmm
kabiri,hmm
gatatu
etc..).The HMM file output by HInit has the same name as the input prototype.E.g
HInit -A -D -T 1 -S train.scp -M model/hmm0 -H hmm
1.txt -l rimwe -L data/train rimwe
This process was repeated for all the models.The HTK tool HCompV was used to initialize
the models to the training data as follows.
HCompV -C config.txt -f 0.01 -m -S train.scp -M hmm0 proto.txt HCompv was not used to
initialise the models (it was already done with HInit).HCompv is only used here because it
outputs,along with the initialised model,an interesting file called vFloors,which contains
the global variance vector multiplied by a factor 0.01 (see Appendix F).The values stored in
varFloor1 (called the ”variance floor macro”) are to be used later during the training process
as floor values for the estimated variance vectors.This results in the creation of two files -
proto and vFloors - in the directory hmm0.These files were edited in the following way:An
error occurs at this point which rearranges the order of the parts of the MFCC
0
D
A label
as MFCC
D
A
0.This was corrected.The first three lines of proto were cut and pasted into
vFloors,this was then saved as macros.
28
3.2.4 Training
The following command line was used to perform one re-estimation iteration with HTK
tool HRest,estimating the optimal values for the HMM parameters (transition probabili-
ties,plus mean and variance vectors of each observation function):HRest -A - D -T 1 -S
train.scp -M model/hmm1 -H vFloors -H model/hmm0/hmm
1.txt -l rimwe -L data/train
rimwe.train.scp gives the complete list of the.mfc files forming the training corpus (stored
in directory data/train/mfc).Model/hmm1,the output directory,indicates the index of
the current iteration.vFloors is the file containing the variance floor macro obtained with
HCompv.Hmm
1.txt is the description file of the HMM called rimwe.It is stored in a
directory whose name indicates the index of the last iteration (here model/hmm0).-l rimwe
is an option that indicates the label to use within the training data (rimwe,kabiri,etc).
Data/train is the directory where the label files (.lab) corresponding to the training corpus.
rimwe is the name of the HMM to train.This procedure has to be repeated several times
for each of the HMMs( kabiri.Gatatu,kane.Sil) to train.Each time,the HRest iterations
(i.e.iterations within the current re-estimation iteration) are displayed on screen,indicating
the convergence through the change measure.As soon as this measure do not decrease (in
absolute value) from one HRest iteration to another,it’s time to stop the process.In this
project 3 re-estimation iterations were used.The final word HMMs are then:hmm3/hmm
1,
hmm3/hmm
0,and hmm3/hmm
sil etc..A file called hmmdefs.txt was created by combin-
ing all the hmms into one file which was consequently named hmmdefs.txt (See appendix E).
After each iteration an error occurred which rearranges the order of the parts of the MFCC
0
D
A
label as MFCC
D
A
0 which was consequently corrected after each iteration.
3.3 Recognition
The recognizer is now complete and its performance can be evaluated.The recognition net-
work and dictionary have already been constructed,and test data has been recorded.Thus,
all that is necessary is to run the recognizer.The recognition process can be summarized as
in the figure below.
29
Figure 3.8:Speech recognition process
An input speech signal input signal is first transformed into a series of ”acoustical vectors”
(here MFCs) using the HTK tool HCopy,in the same way as what was done with the
training data.The result was stored in a file known as test.scp (often called the acoustical
observation).
The input observation was then processed by a Viterbi algorithm,which matches it against
the recogniser’s Markov models using the HTK tool HVite:As follows
HVite -A -D -T 1 -H model/hmm3/hmmdefs.txt -i recout.mlf -w wdnet dict hmmlist.txt -S
test.scp.
Where:
hmmdefs.txt contains the definition of the HMMs.It is possible to repeat the -H option
and list the different HMM definition files,in this case:-H model/hmm3/hmm
0.txt -H
model/hmm3/hmm
1.txt etc..but it is more convenient (especially when there are more
than 3 models) to gather every definitions in a single file called a Master Macro File.For
this project this file was obtained by copying each definition after the other in a single file,
without repeating the header information (see Appendix E).
The output is stored in a file (recout.mlf) which contains the transcription of the input (see
appendix g).
recout.mlf is the output recognition transcription file.
Wdnet is the task network.
30
dict is the task dictionary.
hmmlist.txt lists the names of the models to use (rimwe,kabiri,etc..).Each element is
separated by a new line character.
Test.scp is the input data to be recognised.
3.4 Running the Recognizer Live
The built recogniser was tested with live input.To do this the configuration variables
parameters were altered as given below#Waveform capture
SOURCERATE=625.0
SOURCEKIND=HAUDIO
SOURCEFORMAT=HTK
ENORMALISE=F
USESILDET=T
MEASURESIL=F
OUTSILWARN=T
These indicate that the source is direct audio with sample period 62.5 secs.The silence
detector is enabled and a measurement of the background speech/silence levels was made at
start-up.The final line makes sure that a warning is printed when this silence measurement
is being made.Once the configuration file had been set-up for direct audio input,the HTK
tool HVite was again used to recognize the live in put using a microphone.
31
Chapter 4
RESULTS
The recognition performance evaluation of an ASR system must be measured on a corpus
of data different from the training corpus.A separate test corpus,with new Kinyarwanda
language digits records,was created as it was previously done with the training corpus.The
test corpus was made of 50 recorded and labelled data which were later converted into MFC.
In order to test for speaker independency of the system,some of the sujects who participated
in creation of the testing corpus had not participated in creation of the training corpus.
4.1 Perfomance Test
Evaluation of the performance of the speech recognition system was done by using the HTK
tool HResults.
On running and testing the tool against the testing data,the following performance statistics
were obtained:
4.2 Perfomance Analysis
The first line (SENT) gives the sentence recognition rate (%Correct=92.00),the second one
(WORD) gives the word recognition rate (%Corr=94.87.00).The first line (SENT) should
32
Figure 4.1:Speech recognition results
be considered here.H=46 gives the number of test data correctly recognized,S=4 the
number of substitution errors and N=50 the total number of test data.These results imply
that of the 50 sentences making the testing corpus only 46 were correctly recognized which
is equivalent to 92.00% and four (4) sentences were substituted by other sentences.The
statistics given on the second line (WORD) only make sense with more sophisticated types
of recognition systems (e.g.connected words recognition tasks).Nevertheless,there were 6
deletion errors (D),2 substitution errors (S) and 0 insertion errors (I).N 156 gives the total
number of words making the test data and of these 148 were correctly recognized leading to
a 94.87% recognition.The accuracy figure (Acc) of 94.87% is the same as the percentage
correct (Cor) because it takes account of the insertion errors,which the latter does not but
in this case the insertion errors are zero.These results indicate that the training of the
system was successful and and that the developed system is speaker independent.
4.3 Testing the System on Live Data
To further test the system on live data and also again test its speaker independency,the
system was tested by running it live.Four (4) different speakers who never participated in
the creation of the training corpus helped in testing the system live.Subjects read loudly the
Kinyarwanda language numeric digits and the table below gives a summary of the results.
These results show that the system is speaker independent with a few errors which can be
reduced by training the system on a larger training data and also including recordings from
speakers from different regions of the great lakes region who speak Kinyarwanda.
33
Figure 4.2:Live data recognition results
34
Chapter 5
DISCUSSION,CONCLUSION AND
RECOMMENDATIONS
In this project,the main task was to develop an automatic speech recognizer for Kinyarwanda
language.This system is aimed at improving on the current Human-computer interface by
introducing a voice interface,which has proved to have so many advantages to the traditional
I/O methods.Users naturally know how to speak so this would provide an easy interface
which does not necessarily require special training which is normally the case when you’re to
the use the various ICT tools for the first time.The scope was limited to only the numeric
digits which could be used in many systems most especially the automatic telephone dialing
system.
This five chapter report contains the introduction to the study in chapter one and literature
review on human-computer interfaces,ASR and on going African ASR projects.Chapter
three the methodology that was used to achieve the objectives was looked at while chapter
four concentrated on performance and testing the recognizer developed.This is the last
chapter of the report in which the discussion,conclusion and recommendations are given.
5.1 Discussion
It has been discovered that there are many people who have a computer phobia.The reasons
why many people fear to use ICT tools has been due to the indaquate user interfaces which
make it difficult for the new users to explore or take a step into using these unavoidable ICT
tools.A lot has been done by many researchers on improving the user interfaces and one of
35
the improvements has been including voice interfaces.It was noted by the researcher that
most of these systems developed were mainly considering the five major languages Interna-
tional languages.
The researcher therefore,found it necessary to build an ASR system which could be a start-
ing point for many educational and commercial projects on building speech recognisers for
Kinyarwanda language.In order to develop the system,the researcher first read and analysed
research papers on the trends in speech recognition.Then he read reviews on the current
state of the art speech recognisers.
Before attempting building a speech recogniser for a new language it is always advisable to
start by building one language which is already tested and in this case the researcher first
constructed an English Yes and No recogniser which paved the way for the new language
speech recognisers.
The Cambridge University Hidden Markov Models toolkit (HTK) was used for the imple-
mentation of the recogniser.HTK was used because it is free and has been used by many
reaserchers all over the world.HTK supports both isolated whole word recognition and
sub-word or phone based recognition.
Although the research in the area of automatic speech recognition has been pursued for the
last three decades,only whole-word based speech recognition systems have found practical
use and have become commercial successes (Rabiner et al.,1981 [22];Wilpon et al.,1988
[34]).Two important reasons for this success are that the effects of context-dependence and
co-articulation within the word are implicitly built into the word models and that there is
no necessity of lexical decoding.
Isolated word recognition was considered for this project because it proved to be much easier
because the pauses between the words make it easy to detect the start and end making it
possible to detect each word at a time.
A limited grammar and dictionary were constructed to be used by the recognizer.The
Speech data was recorded and labeled from 6 different speakers making the training and the
testing corpus.
Since the researcher had labeled training data,the HTKtools HInit and HRest were used dur-
ing the initialization and training processing.The results obtained from the system showed
that the system can automatically recognize 94.87 percent words of any Kinyarwanda lan-
36
guage speaker.The system was also tested on live data and it performed well.Four different
speakers participated in the testing of the system on live data and performance was very
good as seen in figure 4.2.There were some cases where the word kane was substituted with
the word karindwi.This problem was mainly observed with some specific speakers not all.
5.2 Conclusion
The objective of this study was mainly to build a speech recognizer for Kinyarwanda language
.In order to meet this objective a limited word grammar was constructed,a dictionary
created and data from different Kinyarwanda language speakers was recorded and trained
thereafter.
The system was tested using testing corpus data and live data and the system scored 92.00%
sentence recognition and 94.87%word recognition.This implies that the objective of creating
a system that can recognize spoken Kinyarwanda language was achieved.
The Kinyarwanda language automatic speech recognition recipe accompanying this report
can be used by any researcher desiring to join language processing research.
The project is however not all conclusive as it has catered for only a voice operated phone
dialing system.As much as it has created a basis for research,this project can be expanded
to cater for more extensive language models and larger vocabularies.
5.3 Areas for Further Study
In spite of the successes of the whole word model speech recognizers which is also exemplified
in the success of this project,they suffer from two problems:
• Co-articulation effects across the word boundaries.This problem has been reasonably
well solved and connected word recognition systems with good performance have been
reported in the literature (Rabiner et al.,1981 [22];Wilpon et al.,1988 [34]).
• Amount of training data.It is extremely difficult to obtain good whole word reference
models from a limited amount of speech data available for training.This training
problem becomes even worse for large vocabulary speech recognition systems.
37
It is because of the above reasons that I therefore recommend for future research to be taken
in large vocabulary Kinyarwanda language speech recognition,using sub-words (phonemes)
which solve the above mentioned problems.Asub-word based approach is a viable alternative
to the whole-word based approach because here,the word models are built from a small
inventory of sub-word units.
Phoneme HMMs are generalisable (trainable) both towards larger vocabulary and towards
different speakers.
38
REFERENCES
1.Baum,L.E.,and Petrie,T.,(1966).Statistical Inference for Probabilistic functions of
Finite-State Markov Chains,Annotated Mathematical Statistics,37:1554-1563.
2.Bridle,J.,Deng,L.,Picone,J.,Richards,H.,Ma,J.,Kamm,T.,Schuster,M.,Pike,
S.,Reagan,R.,1998.An investigation of segmental hidden dynamic models of speech
coarticulation for automatic speech recognition.Final Report for the 1998 Workshop
on Language Engineering,Center for Language and Speech Processing at Johns Hop-
kins University,pp.161.
3.Cole,R.Noel,M.Burnet,D.C.,Fanty,M.,Lander,T.,Oshika,B.,Sutton,S.,1994
Corpus development activities at the center for spoken language understanding.Hu-
man Language Technology Conference archive,Proceedings of the workshop on Human
Language Technology.Pages:31 - 36.
4.R.Cole,K.Roginski,and M.Fanty.,1992 A telephone speech database of spelled and
spoken names.In ICSLP’92,volume 2,pages 891–895.
5.Deshmukh,N.,Ganapathiraju,A,Picone J.,(1999),Hierarchical Search for Large
Vocabulary Conversational Speech Recognition.IEEE Signal Processing Magazine,
1(5):84-107.
6.Dix,A.J.,Finlay,J.,Abowd,G.,Beale,R.(1998).Human-Computer Interaction,2nd
edition,Prentice Hall,Englewood Cliffs,NJ,USA.
7.Dupont,S.,(2000),Audio-Visual Speech Modeling for Continuous Speech Recognition,
IEEE Transactions on multimedia,2(3):141-151
39
8.Earth trends,(2003) Population,Health,and Human Well-Being- Rwanda.Retrieved
20-01-2005 from
http://earthtrends.wri.org/pdf
library/country
profiles/Pop
cou
646.pdf.
9.Huang,C.,Tao,C.,ANDChang,E.,(2004).Accent Issues in Large Vocabulary Contin-
uous Speech Recognition INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY(7):141-
153
10.Jurafsky D.,Martin J.(2000).Speech and Language Processing:An Introduction
to Natural Language Processing,Computational Linguistics and Speech Recognition.
Delhi,India:Pearson Education.
11.Kagaba,S.,Nsanzabaganwa,S.,Mpyisi,E.,(2003),Rwanda Country Position Paper,
Regional Workshop on Ageing and Poverty Dar es Salaam,Tanzania.retrieved 20-02-
2005 from
http://www.un.org/esa/socdev/ageing/workshops/tz/rwanda.pdf.
12.Kandasamy,S.,(1995),Speech recognition systems.SURPRISE Journal,1(1).
13.Liu,F.H.,Liang G.,Yuqing G.AND Picheny,M,(2004).Applications of Lan-
guage Modeling in Speech-To-Speech Translation INTERNATIONAL JOURNAL OF
SPEECH TECHNOLOGY (7):221-229.
14.Ma,J.,Deng,L.,2004.Target-directed mixture linear dynamic models for spontaneous
speech recognition.IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESS-
ING,VOL.12,NO.1,JANUARY 2004.
15.Ma,J.,Deng,L.,2004 A mixed-level switching dynamic system for continuous speech
recognition.Elsevier Computer Speech and Language 18 (2004) 4965.
16.Mane,A.,Boyce,S.,Karis,D.,Yankelovich,N.,(1996) Designing the User Interface for
Speech Recognition Applications SIGCHI Bulletin 28(4):29-34.
17.Mengjie,Z.,(2001) Overview of speech recognition and related machine learning tech-
niques,Technical report.retrieved December 10,2004 from
http://www.mcs.vuw.ac.nz/comp/Publications/archive/CS-TR-01/CS-TR-01-15.pdf
40
18.Mori R.D,LamL.,and Gilloux M.(1987).Learning and plan refinement in a knowledge-
based system for automatic speech recognition.IEEE Transaction on Pattern Analysis
Machine Intelligence,9(2):289-305.
19.Picheny,M.,(2002).Large vocabulary speech recognition,IEEE Computer,35(4):42-
50.
20.Pinker,S.,(1994),The Language Instinct,Harper Collins,New York City,New York,
USA.
21.Rabiner L.R.,S.E.L.evinson:(1981) ”Isolated and connected word recognition - Theory
and selected applications”,IEEE Trans.COM-29,pp.621-629
22.Rabiner,L.,R.,and Wilpon,J.G.,(1979).Considerations in applying clustering
techniques to speaker-independent word recognition.Journal of Acoustic Society of
America.66(3):663-673.
23.Reddy D.R.,(1976).Speech Recognition by Machine:a Review.Proceeding of IEEE,
64(4):501-531
24.Robertson,J.,Wong,Y.T.,Chung,C.,and Kim,D.K.,(1998) Automatic Speech
Recognition for Generalised Time Based Media Retrieval and Indexing,Proceedings of
the sixth ACM International Conference on Multimedia(pp 241-246) Bristol,England.
25.Roux,J.C.,Botha,E.C.,and Du Preez,J.A.,(2000).Developing a Multilingual
Telephone Based Information System in African Languages.Proceedings of the Second
International Language Resources and Evaluation Conference.Athens,Greece:ELRA.
(2):975-980.
26.Rudnicky,A.I.,Lee,K.F.,and Hauptmann,A.G.(1992) Survey of current speech
technology.Communications of the ACM,37(3):52-57.
27.Scan soft (2004).Embeded speech soloutions retrieved January 25,2005 from
http://www.speechworks.com/
28.Silverman,H.,F.,and Morgan,D.P.,(1990).The application of dynamic programming
to connected speech recognition.IEEE ASSP Magazine,7(3):6-25.
41
29.Svendsen T.,Paliwal K.K.,Harborg E.,Husy P.O.(1989).Proc.ICASSP’89,Glas-
gow,
30.Tiong,B.,(1997) Speech Recognition retrieved December 10,2004 from
http://murray.newcastle.edu.au/users/staff/speech/home
pages/tutorial
sr.html.
31.Tolba,H.,and O’Shaughnessy,D.,(2001).Speech Recognition by Intelligent Machines,
IEEE Canadian Review (38).
32.Warwick,C.,1997 What is the BNC?[Online].Available from World Wide Web:
http://www.hcu.ox.ac.uk/BNC¿ retrieved on 20-05-2005.
33.Webster’s dictionary (2004).illiterate retrieved September 23,2004 fromhttp://www.webster-
dictionary.org/definition/illiterate.
34.Wilpon J.G.,D.M.DeMarco,R.P.Mikkilineni (1988) ”Isolated word recognition over
the DD telephone network -Results of two extensive field studies”,Proc.ICASSP,pp.
55-58
35.Young,S.,G.Evermann,T.Hain,D.Kershaw,G.Moore,J.Odell,D.Ollason,D.
Povey,V.Valtchev,P.Woodland,( 2002) The HTK Book.Retrieved April 1,2005
from:http://htk.eng.cam.ac.uk.
36.Zue,V.,Cole,R.,Ward,W.(1996).Speech Recognition.Survey of the State of the Art
in Human Language Technology.Kauii,Hawaii,USA
42
APPENDICES
Apendix A
Word Network
VERSION=1.0
N=15 L=24
I=0 W=!NULL
I=1 W=!NULL
I=2 W=SENT-START
I=3 W=RIMWE
I=4 W=!NULL
I=5 W=KABIRI
I=6 W=GATATU
I=7 W=KANE
I=8 W=GATANU
I=9 W=GATANDATU
I=10 W=KARINDWI
I=11 W=UMUNANI
I=12 W=ICYENDA
I=13 W=ZERO
I=14 W=SENT-END
J=0 S=14 E=1
J=1 S=0 E=2
J=2 S=2 E=3
43
J=3 S=3 E=4
J=4 S=5 E=4
J=5 S=6 E=4
J=6 S=7 E=4
J=7 S=8 E=4
J=8 S=9 E=4
J=9 S=10 E=4
J=10 S=11 E=4
J=11 S=12 E=4
J=12 S=13 E=4
J=13 S=2 E=5
J=14 S=2 E=6
J=15 S=2 E=7
J=16 S=2 E=8
J=17 S=2 E=9
J=18 S=2 E=10
J=19 S=2 E=11
J=20 S=2 E=12
J=21 S=2 E=13
J=22 S=2 E=14
J=23 S=4 E=14
44
Appendix B
Training Sentences
1.sil sil
2.sil gatatu sil
3.sil gatanu sil
4.sil gatanu sil
5.sil sil
6.sil karindwi sil
7.sil zero sil
8.sil umunani sil
9.sil gatanu sil
10.sil kane sil
11.sil icyenda sil
12.sil zero sil
13.sil icyenda sil
14.sil gatandatu sil
15.sil zero sil
16.sil sil
17.sil umunani sil
18.sil umunani sil
19.sil gatatu sil
20.sil gatandatu sil
21.sil karindwi sil
22.sil kane sil
23.sil karindwi sil
24.sil gatandatu sil
25.sil kane sil
26.sil gatanu sil
27.sil gatatu sil
28.sil zero sil
45
29.sil sil
30.sil sil
31.sil icyenda sil
32.sil kabiri sil
33.sil kabiri sil
34.sil gatanu sil
35.sil gatanu sil
36.sil icyenda sil
37.sil kabiri sil
38.sil kane sil
39.sil gatanu sil 40.sil gatanu sil
41.sil gatanu sil
42.sil icyenda sil
43.sil gatanu sil
44.sil rimwe sil
45.sil zero sil
46.sil sil
47.sil sil
48.sil kane sil
49.sil zero sil
50.sil gatandatu sil
46
Appendix C
Master label file
#!MLF!#
”data/train/rimwe01.lab”
RIMWE
.
”data/train/rimwe02.lab”
RIMWE
.
”data/train/rimwe03.lab”
RIMWE
.
”data/train/rimwe04.lab”
RIMWE
.
”data/train/rimwe05.lab”
RIMWE
.
”data/train/rimwe06.lab”
RIMWE
.
”data/train/rimwe07.lab”
RIMWE
.
”data/train/rimwe08.lab”
RIMWE
.
”data/train/rimwe09.lab”
RIMWE
.
47
”data/train/rimwe10.lab”
RIMWE
.
”data/train/rimwe11.lab” RIMWE
.
”data/train/rimwe12.lab” RIMWE
.
”data/train/rimwe13.lab” RIMWE
.
”data/train/rimwe14.lab” RIMWE
.
”data/train/rimwe15.lab” RIMWE
.
”data/train/kabiri01.lab”
KABIRI
.
”data/train/kabiri01.lab”
KABIRI
.
”data/train/kabiri02.lab”
KABIRI
.
”data/train/kabiri03.lab”
KABIRI
.
”data/train/kabiri03.lab”
KABIRI
.
”data/train/kabiri04.lab”
KABIRI
.
48
”data/train/kabiri05.lab”
KABIRI
.
”data/train/kabiri06.lab”
KABIRI
.
”data/train/kabiri07.lab”
KABIRI
.
”data/train/kabiri08.lab”
KABIRI
.
”data/train/kabiri09.lab”
KABIRI
.
”data/train/kabiri10.lab”
KABIRI
.
”data/train/kabiri11.lab”
KABIRI
.
”data/train/kabiri12.lab”
KABIRI
.
”data/train/kabiri13.lab”
KABIRI
.
”data/train/kabiri14.lab”
KABIRI
.
”data/train/kabiri15.lab”
49
KABIRI
.
”data/train/gatatu01.lab”
GATATU
.
”data/train/gatatu02.lab”
GATATU
.
”data/train/gatatu03.lab”
GATATU
.
”data/train/gatatu04.lab”
GATATU
.
”data/train/gatatu05.lab”
GATATU
.
”data/train/gatatu06.lab”
GATATU
.
”data/train/gatatu07.lab”
GATATU
.
”data/train/gatatu08.lab”
GATATU
.
”data/train/gatatu09.lab”
GATATU
.
”data/train/gatatu10.lab”
GATATU
50
.
”data/train/gatatu11.lab”
GATATU
.
”data/train/gatatu12.lab”
GATATU
.
”data/train/gatatu13.lab”
GATATU
.
”data/train/gatatu14.lab”
GATATU
.
”data/train/gatatu15.lab”
GATATU
.
”data/train/kane01.lab”
KANE
.
”data/train/kane02.lab”
KANE
.
”data/train/kane03.lab”
KANE
.
”data/train/kane04.lab”
KANE
.
”data/train/kane05.lab”
KANE
.
51
”data/train/kane06.lab”
KANE
.
”data/train/kane07.lab”
KANE
.
”data/train/kane08.lab”
KANE
.
”data/train/kane09.lab”
KANE
.
”data/train/kane10.lab”
KANE
.
”data/train/kane11.lab”
KANE
.
”data/train/kane12.lab”
KANE
.
”data/train/kane13.lab”
KANE
.
”data/train/kane14.lab”
KANE
.
”data/train/kane15.lab”
KANE
.
”data/train/gatanu01.lab”
52
GATANU
.
”data/train/gatanu02.lab”
GATANU
.
”data/train/gatanu03.lab”
GATANU
.
”data/train/gatanu04.lab”
GATANU
.
”data/train/gatanu05.lab”
GATANU
.
”data/train/gatanu06.lab”
GATANU
.
”data/train/gatanu07.lab”
GATANU
.
”data/train/gatanu08.lab”
GATANU
.
”data/train/gatanu09.lab”
GATANU
.
”data/train/gatanu10.lab”
GATANU
.
”data/train/gatanu11.lab”
GATANU
53
.
”data/train/gatanu12.lab”
GATANU
.
”data/train/gatanu13.lab”
GATANU
.
”data/train/gatanu14.lab”
GATANU
.
”data/train/gatanu15.lab”
GATANU
.
”data/train/gatandatu01.lab”
GATANDATU
.
”data/train/gatandatu02.lab”
GATANDATU
.
”data/train/gatandatu03.lab”
GATANDATU
.
”data/train/gatandatu04.lab”
GATANDATU
.
”data/train/gatandatu05.lab”
GATANDATU
.
”data/train/gatandatu06.lab”
GATANDATU
.
54
”data/train/gatandatu07.lab”
GATANDATU
.
”data/train/gatandatu08.lab”
GATANDATU
.
”data/train/gatandatu09.lab”
GATANDATU
.
”data/train/gatandatu10.lab”
GATANDATU
.
”data/train/gatandatu11.lab”
GATANDATU
.
”data/train/gatandatu12.lab”
GATANDATU
.
”data/train/gatandatu13.lab”
GATANDATU
.
”data/train/gatandatu14.lab”
GATANDATU
.
”data/train/gatandatu15.lab”
GATANDATU
.
”data/train/karindwi01.lab”
KARINDWI
.
”data/train/karindwi02.lab”
55
KARINDWI
.
”data/train/karindwi03.lab”
KARINDWI
.
”data/train/karindwi04.lab”
KARINDWI
.
”data/train/karindwi05.lab”
KARINDWI
.
”data/train/karindwi06.lab”
KARINDWI
.
”data/train/karindwi07.lab”
KARINDWI.
”data/train/karindwi08.lab”
KARINDWI
.
”data/train/karindwi09.lab”
KARINDWI
.
”data/train/karindwi10.lab”
KARINDWI
.
”data/train/karindwi11.lab”
KARINDWI
.
”data/train/karindwi12.lab”
KARINDWI
.
56
”data/train/karindwi13.lab”
KARINDWI
.
”data/train/karindwi14.lab”
KARINDWI
.
”data/train/karindwi15.lab”
KARINDWI
.
”data/train/umunani01.lab”
UMUNANI
.
”data/train/umunani02.lab”
UMUNANI
.
”data/train/umunani03.lab”
UMUNANI
.
”data/train/umunani04.lab”
UMUNANI
.
”data/train/umunani05.lab”
UMUNANI
.
”data/train/umunani06.lab”
UMUNANI
.
”data/train/umunani07.lab”
UMUNANI
.
”data/train/umunani08.lab”
57
UMUNANI
.
”data/train/umunani09.lab”
UMUNANI
.
”data/train/umunani10.lab”
UMUNANI
.
”data/train/umunani11.lab”
UMUNANI
.
”data/train/umunani12.lab”
UMUNANI
.
”data/train/umunani13.lab”
UMUNANI
.
”data/train/umunani14.lab”
UMUNANI
.
”data/train/umunani15.lab”
UMUNANI
.
”data/train/icyenda01.lab”
ICYENDA
.
”data/train/icyenda02.lab”
ICYENDA
.
”data/train/icyenda03.lab”
ICYENDA
58
.
”data/train/icyenda04.lab”
ICYENDA
.
”data/train/icyenda05.lab”
ICYENDA
.
”data/train/icyenda06.lab”
ICYENDA
.
”data/train/icyenda07.lab”
ICYENDA
.
”data/train/icyenda08.lab”
ICYENDA
.
”data/train/icyenda09.lab”
ICYENDA
.
”data/train/icyenda10.lab”
ICYENDA
.
”data/train/icyenda11.lab”
ICYENDA
.
”data/train/icyenda12.lab”
ICYENDA
.
”data/train/icyenda13.lab”
ICYENDA
.
59
”data/train/icyenda14.lab”
ICYENDA
.
”data/train/icyenda15.lab”
ICYENDA
.
”data/train/zero01.lab”
ZERO
.
”data/train/zero02.lab”
ZERO
.
”data/train/zero03.lab”
ZERO
.
”data/train/zero04.lab”
ZERO
.
”data/train/zero05.lab”
ZERO
.
”data/train/zero06.lab”
ZERO
.
”data/train/zero07.lab”
ZERO
.
”data/train/zero08.lab”
ZERO
.
60
”data/train/zero09.lab”
ZERO
.
”data/train/zero10.lab”
ZERO
.
”data/train/zero11.lab”
ZERO
.
”data/train/zero12.lab”
ZERO
.
”data/train/zero13.lab”
ZERO
.
”data/train/zero14.lab”
ZERO
.
”data/train/zero15.lab”
ZERO
61
Appendix D
Training Data
data/MFC/rimwe01.MFC
data/MFC/rimwe02.MFC
data/MFC/rimwe03.MFC
data/MFC/rimwe04.MFC
data/MFC/rimwe05.MFC
data/MFC/rimwe06.MFC
data/MFC/rimwe07.MFC
data/MFC/rimwe08.MFC
data/MFC/rimwe09.MFC
data/MFC/rimwe10.MFC
data/MFC/rimwe11.MFC
data/MFC/rimwe12.MFC
data/MFC/rimwe13.MFC
data/MFC/rimwe14.MFC
data/MFC/rimwe15.MFC
data/MFC/kabiri01.MFC
data/MFC/kabiri02.MFC
data/MFC/kabiri03.MFC
data/MFC/kabiri04.MFC
data/MFC/kabiri05.MFC
data/MFC/kabiri06.MFC
data/MFC/kabiri07.MFC
data/MFC/kabiri08.MFC
data/MFC/kabiri09.MFC
data/MFC/kabiri10.MFC
data/MFC/kabiri11.MFC
data/MFC/kabiri12.MFC
data/MFC/kabiri13.MFC
62
data/MFC/kabiri14.MFC
data/MFC/kabiri15.MFC
data/MFC/gatatu01.MFC
data/MFC/gatatu02.MFC
data/MFC/gatatu03.MFC
data/MFC/gatatu04.MFC
data/MFC/gatatu05.MFC
data/MFC/gatatu06.MFC
data/MFC/gatatu07.MFC
data/MFC/gatatu08.MFC
data/MFC/gatatu09.MFC
data/MFC/gatatu10.MFC
data/MFC/gatatu11.MFC
data/MFC/gatatu12.MFC
data/MFC/gatatu13.MFC
data/MFC/gatatu14.MFC
data/MFC/gatatu15.MFC
data/MFC/kane01.MFC
data/MFC/kane02.MFC
data/MFC/kane03.MFC
data/MFC/kane04.MFC
data/MFC/kane05.MFC
data/MFC/kane06.MFC
data/MFC/kane07.MFC
data/MFC/kane08.MFC
data/MFC/kane09.MFC
data/MFC/kane10.MFC
data/MFC/kane11.MFC
data/MFC/kane12.MFC
data/MFC/kane13.MFC
data/MFC/kane14.MFC
63
data/MFC/kane15.MFC
data/MFC/gatanu01.MFC
data/MFC/gatanu02.MFC
data/MFC/gatanu03.MFC
data/MFC/gatanu04.MFC
data/MFC/gatanu05.MFC
data/MFC/gatanu06.MFC
data/MFC/gatanu07.MFC
data/MFC/gatanu08.MFC
data/MFC/gatanu09.MFC
data/MFC/gatanu10.MFC
data/MFC/gatanu11.MFC
data/MFC/gatanu12.MFC
data/MFC/gatanu13.MFC
data/MFC/gatanu14.MFC
data/MFC/gatanu15.MFC
data/MFC/gatandatu01.MFC
data/MFC/gatandatu02.MFC
data/MFC/gatandatu03.MFC
data/MFC/gatandatu04.MFC
data/MFC/gatandatu05.MFC
data/MFC/gatandatu06.MFC
data/MFC/gatandatu07.MFC
data/MFC/gatandatu08.MFC
data/MFC/gatandatu09.MFC
data/MFC/gatandatu10.MFC
data/MFC/gatandatu11.MFC
data/MFC/gatandatu12.MFC
data/MFC/gatandatu13.MFC
data/MFC/gatandatu14.MFC
data/MFC/gatandatu15.MFC
64
data/MFC/karindwi01.MFC
data/MFC/karindwi02.MFC
data/MFC/karindwi03.MFC
data/MFC/karindwi04.MFC
data/MFC/karindwi05.MFC
data/MFC/karindwi06.MFC
data/MFC/karindwi07.MFC
data/MFC/karindwi08.MFC
data/MFC/karindwi09.MFC
data/MFC/karindwi10.MFC
data/MFC/karindwi11.MFC
data/MFC/karindwi12.MFC
data/MFC/karindwi13.MFC
data/MFC/karindwi14.MFC
data/MFC/karindwi15.MFC
data/MFC/umunani01.MFC
data/MFC/umunani02.MFC
data/MFC/umunani03.MFC
data/MFC/umunani04.MFC
data/MFC/umunani05.MFC
data/MFC/umunani06.MFC
data/MFC/umunani07.MFC
data/MFC/umunani08.MFC
data/MFC/umunani09.MFC
data/MFC/umunani10.MFC
data/MFC/umunani11.MFC
data/MFC/umunani12.MFC
data/MFC/umunani13.MFC
data/MFC/umunani14.MFC
data/MFC/umunani15.MFC
data/MFC/icyenda01.MFC
65
data/MFC/icyenda02.MFC
data/MFC/icyenda03.MFC
data/MFC/icyenda04.MFC
data/MFC/icyenda05.MFC
data/MFC/icyenda06.MFC
data/MFC/icyenda07.MFC
data/MFC/icyenda08.MFC
data/MFC/icyenda09.MFC
data/MFC/icyenda10.MFC
data/MFC/icyenda11.MFC
data/MFC/icyenda12.MFC
data/MFC/icyenda13.MFC
data/MFC/icyenda14.MFC
data/MFC/icyenda15.MFC
data/MFC/zero01.MFC
data/MFC/zero02.MFC
data/MFC/zero03.MFC
data/MFC/zero04.MFC
data/MFC/zero05.MFC
data/MFC/zero06.MFC
data/MFC/zero07.MFC
data/MFC/zero08.MFC
data/MFC/zero09.MFC
data/MFC/zero10.MFC
data/MFC/zero11.MFC
data/MFC/zero12.MFC
data/MFC/zero13.MFC
data/MFC/zero14.MFC
data/MFC/zero15.MFC
66
Appendix E
Hidden Markov Model Definitions (HMMDEFS)
~o
<STREAMINFO> 1 39
<VECSIZE> 39<NULLD><MFCC_0_D_A><DIAGC>
~h"zero"
<BEGINHMM>
<NUMSTATES> 6
<STATE> 2
<MEAN> 39
-1.538187e+001 1.141508e+001 -3.588139e+000 -1.159882e+000 -1.452020e+000 -8.341283e+000
<VARIANCE> 39
3.046115e+001 3.921619e+001 1.723766e+001 2.001421e+001 3.992482e+001 3.596347e+001 2.784846e+001
<GCONST> 1.137821e+002
<STATE> 3
<MEAN> 39
-1.491195e+000 -6.492606e+000 -1.891563e-001 -6.878118e+000 -6.327397e+000 -1.235269e+001
<VARIANCE> 39
2.520783e+000 8.964164e+000 5.252084e+000 8.973154e+000 5.499793e+000 1.332134e+001 2.178135e+001
<GCONST> 9.035600e+001
<STATE> 4
<MEAN> 39
-9.309770e+000 -9.457813e+000 -2.599780e+000 -1.757934e+001 -1.275383e+001 -1.126780e+001
<VARIANCE> 39
6.970012e+001 2.225276e+001 4.992588e+001 4.126175e+001 2.610523e+001 7.679757e+001 6.116331e+001
<GCONST> 1.238130e+002
<STATE> 5
<MEAN> 39
-2.297705e-001 -4.164129e-002 -1.899639e+000 -9.609221e+000 -5.382258e+000 -1.236597e+000
67
<VARIANCE> 39
8.854380e+000 7.536385e+000 1.740920e+001 4.921722e+001 3.659902e+001 1.955439e+001 4.722785e+001
<GCONST> 1.009971e+002
<TRANSP> 6
0.000000e+000 1.000000e+000 0.000000e+000 0.000000e+000 0.000000e+000 0.000000e+000
0.000000e+000 9.060647e-001 6.262358e-002 3.131179e-002 0.000000e+000 0.000000e+000
0.000000e+000 0.000000e+000 9.430364e-001 5.696366e-002 0.000000e+000 0.000000e+000
0.000000e+000 0.000000e+000 0.000000e+000 9.249576e-001 7.504237e-002 0.000000e+000
0.000000e+000 0.000000e+000 0.000000e+000 0.000000e+000 8.498170e-001 1.501830e-001
0.000000e+000 0.000000e+000 0.000000e+000 0.000000e+000 0.000000e+000 0.000000e+000
<ENDHMM>
~h"rimwe"
<BEGINHMM>
<NUMSTATES> 6
<STATE> 2
<MEAN> 39
-7.970719e+000 -7.500427e+000 1.132444e+001 -1.286703e+001 -7.432399e+000 -1.751952e+001
<VARIANCE> 39
3.849774e+001 2.229306e+001 4.268061e+001 8.451440e+001 3.439103e+001 4.802508e+001 5.185015e+001
<GCONST> 1.317715e+002
<STATE> 3
<MEAN> 39
-2.377380e+000 -3.663290e+000 4.965676e+000 -1.033556e+001 -7.324887e+000 -1.087329e+001
<VARIANCE> 39
7.865341e+000 1.011867e+001 4.059527e+000 3.692888e+001 2.392439e+001 1.843463e+001 3.225859e+001
<GCONST> 1.058150e+002
<STATE> 4
<MEAN> 39
-7.165953e+000 -6.947466e+000 6.544258e+000 -1.652563e+001 -9.213765e+000 -1.855777e+001
<VARIANCE> 39
3.759945e+001 6.370345e+000 9.036909e+000 1.956501e+002 2.907838e+001 4.600018e+001 2.415433e+001
68
<GCONST> 1.079340e+002
<STATE> 5
<MEAN> 39
-6.314114e+000 -4.532432e+000 7.106805e+000 -7.048369e+000 -8.000018e+000 -1.071996e+001
<VARIANCE> 39
2.019156e+001 3.633694e+001 1.606951e+001 1.441847e+002 5.447787e+001 4.976671e+001 2.165058e+001
<GCONST> 1.010713e+002
<TRANSP> 6
0.000000e+000 9.333376e-001 6.666239e-002 0.000000e+000 0.000000e+000 0.000000e+000
0.000000e+000 9.137994e-001 8.620062e-002 0.000000e+000 0.000000e+000 0.000000e+000
0.000000e+000 0.000000e+000 8.917421e-001 1.082579e-001 0.000000e+000 0.000000e+000
0.000000e+000 0.000000e+000 0.000000e+000 8.244619e-001 1.755382e-001 0.000000e+000
0.000000e+000 0.000000e+000 0.000000e+000 0.000000e+000 9.221094e-001 7.789055e-002
0.000000e+000 0.000000e+000 0.000000e+000 0.000000e+000 0.000000e+000 0.000000e+000
<ENDHMM>
~h"kabiri"
<BEGINHMM>
<NUMSTATES> 6
<STATE> 2
<MEAN> 39
-7.888013e+000 -1.187933e+001 -2.681767e+000 -1.294684e+001 1.227886e+000 -9.493131e+000
<VARIANCE> 39
4.862306e+001 1.635481e+001 3.838939e+001 3.666690e+001 4.893019e+001 3.454409e+001 4.188983e+001
<GCONST> 1.329512e+002
<STATE> 3
<MEAN> 39
-1.659296e+000 -7.659583e+000 8.000670e+000 -1.012939e+000 -4.288243e+000 -1.464354e+001
<VARIANCE> 39
8.231427e+000 6.313235e+000 6.235238e+001 3.491607e+001 1.448567e+001 1.007993e+001 1.463966e+001
<GCONST> 9.725203e+001
<STATE> 4
69
<MEAN> 39
-7.019081e+000 -4.749569e+000 1.820031e+001 -9.001300e+000 -8.113852e+000 -1.175638e+001
<VARIANCE> 39
9.790601e+000 2.004658e+001 1.836600e+001 3.005601e+001 2.896993e+001 3.778489e+001 1.294033e+001
<GCONST> 1.157377e+002
<STATE> 5
<MEAN> 39
-1.058715e+001 -7.019902e-001 1.127821e+001 -2.145595e+001 -6.475991e+000 -1.326184e+001
<VARIANCE> 39
4.183236e+001 3.304543e+001 2.711980e+001 1.644754e+002 4.248949e+001 5.100737e+001 2.856202e+001
<GCONST> 1.120947e+002
<TRANSP> 6
0.000000e+000 1.000000e+000 0.000000e+000 0.000000e+000 0.000000e+000 0.000000e+000
0.000000e+000 9.191111e-001 8.088891e-002 0.000000e+000 0.000000e+000 0.000000e+000
0.000000e+000 0.000000e+000 8.458276e-001 1.130628e-001 4.110956e-002 0.000000e+000
0.000000e+000 0.000000e+000 0.000000e+000 9.421099e-001 5.789007e-002 0.000000e+000
0.000000e+000 0.000000e+000 0.000000e+000 0.000000e+000 9.339923e-001 6.600768e-002
0.000000e+000 0.000000e+000 0.000000e+000 0.000000e+000 0.000000e+000 0.000000e+000
<ENDHMM>
~o
<STREAMINFO> 1 39
<VECSIZE> 39<NULLD><MFCC_0_D_A><DIAGC>
~h"kabiri"
<BEGINHMM>
<NUMSTATES> 6
<STATE> 2
<MEAN> 39
-7.888013e+000 -1.187933e+001 -2.681767e+000 -1.294684e+001 1.227886e+000 -9.493131e+000
<VARIANCE> 39
4.862306e+001 1.635481e+001 3.838939e+001 3.666690e+001 4.893019e+001 3.454409e+001 4.188983e+001
<GCONST> 1.329512e+002
70
<STATE> 3
<MEAN> 39
-1.659296e+000 -7.659583e+000 8.000670e+000 -1.012939e+000 -4.288243e+000 -1.464354e+001
<VARIANCE> 39
8.231427e+000 6.313235e+000 6.235238e+001 3.491607e+001 1.448567e+001 1.007993e+001 1.463966e+001
<GCONST> 9.725203e+001
<STATE> 4
<MEAN> 39
-7.019081e+000 -4.749569e+000 1.820031e+001 -9.001300e+000 -8.113852e+000 -1.175638e+001
<VARIANCE> 39
9.790601e+000 2.004658e+001 1.836600e+001 3.005601e+001 2.896993e+001 3.778489e+001 1.294033e+001
<GCONST> 1.157377e+002
<STATE> 5
<MEAN> 39