LVSR_System_Design

ugliestharrasSoftware and s/w Development

Nov 4, 2013 (3 years and 7 months ago)

51 views






Arthur Kunkle

ECE 5
5
25

LVSR System Design


Contents

Introduction and Motivation

................................
................................
................................
.......................

3

Technology Overview

................................
................................
................................
................................

3

System Requirements

................................
................................
................................
...............................

5

System Design

................................
................................
................................
................................
............

5

Data Preparation

................................
................................
................................
................................
....

6

Acoustic Model Training

................................
................................
................................
........................

9

Language Model Training

................................
................................
................................
...................

10

Model Testing

................................
................................
................................
................................
.......

11

Milestones

................................
................................
................................
................................
.................

11

Open Issues and Questions

................................
................................
................................
...................

11

References

................................
................................
................................
................................
................

12



Introduction and Motivation

A Large Vocabulary Speech Recognition (LVSR) system is a system that is able to convert speech data
into textual transcriptions. This system will serve as a test
-
bed for the development of new speech
recognition technologies
.

This design document assumes basic knowledge of the tasks an LVSR must accomplish, as well as some
in
-
depth knowl
edge of the HTK framework.

Technology

Overview

The following major technologies will be used to develop the LVSR system:

1.


HMM Toolkit (HTK)


The HTK framework was originally developed in 1989 by the Speech Vision and Robotics Group
of Cambridge University.
HTK provides a rich toolset of utilities that perform many functions
related to Speech Recognition. HTK is a
Unix
-
native application and i
s open
-
source for research
development purposes.


2.

Cygwin UNIX Emulation Environment


Cygwin
is a collection of tools originally developed by Cygnus Solutions to provide in Microsoft
Windows a command line and programming interface familiar to Unix users.

(
2
)


The HTK framework is natively built using GNU tools such as configure, make, and gcc. Cygwin
allows these tools to run easily on a Windows Server platform. Cygwin also has native support
for Perl, the scripting language of choice for this system.


Using Cygwin will also allow easy portability to other Unix/Linux based platforms for parallel
processing.

The only porting requirements are t
he HTK framework would need to be rebuilt and
Perl must be installed.


3.

Practical Extraction and Reporting Languag
e (PERL)


In computer programming, Perl is a high
-
level, general
-
purpose, interpreted, dynamic
programming language.

The language provides powerful text processing facilities without the
arbitrary data length limits of many contemporary Unix

shell programs
.
(
1)


Perl also provides a repository of open
-
source community developed modules (called CPAN)

(
4
)
.
The following modules are some that may be
utilized

in the LVSR System:




File::Headerinfo::WAV

-

an extractor of useful information from WAV files.



File:
:Find

-

Traverse a directory tree.



Spreadsheet::WriteExcel

-

Write to a cross
-
platform Excel binary file.



Config::Simple

-

simple configuration file class



Win32::TaskScheduler

-

Perl extension for managing Win32 jobs scheduled via Task
Scheduler



Math::Matl
ab::Local

-

Interface to a local Matlab process.



Email::Send

-

Simply Sending Email



4.

Subversion Configuration Management Tool


Subversion (SVN) is a version control system initiated in 2000 by CollabNet Inc. It is used to
maintain current and historical v
ersions of files such as source code, web pages, and
documentation.

(
3
)


Subversion will be used to control both the LVSR scripts developed as well as a snapshot of the
HTK baseline itself. This will be used when modifications to HTK tools are required.

Documentation such as this will also be stored in the repository.

The following is a proposed
directory hierarchy:






System Requirements


The LVSR system is characterized by the following major requirements.

The LVSR shall…

1.

Be capable of incorporatin
g
prepared data that conforms to a standard

HTK

interface

(defined in
“System Design”)
.


2.

Automatically generate language and acoustic models of all

available

conforming input data.


3.

Be configurable to use multiple processors and/or remote computers to shar
e workload for model
re
-
estimation and testing.


4.

Have a scheduling mechanism to run different configuration profiles and create different results
directories for each, containing the acoustic and language models.


5.

Record all HTK tool output for a “run” in
time stamped
log
files.


6.

Merge Language Models together and determine the optimum weighting for models based upon

measuring model Perplexity.


7.

Email a
list

of users information regarding run errors and completion status.

System Design

The LVSR system is br
oken down into the following major components:

1.

Data Preparation, Phase 1



This step takes data that may be in a completely custom format
and processes it to comply with a “standard interface”.

Data is processed in accordance with
configuration files.

2.

Dat
a Preparation, Phase 2



This step merges all data available from (1) into the final data set
that will be used to generate HTK models. This step also uses “global” data such as
pronunciation dictionaries.

3.

Acoustic Model Training


This step
uses the p
repared corpus data to
generate
HMM acoustic
models
.

4.

Language Model Training


This step will create specific Language Models for the different
prepared data sources, and then combine them into an overall LM.

5.

Model Testing


This step uses the input from (
3) and (4) to test the model “goodness” against
a chosen subset of the data (known as TEST data).

The following standard directory layout will be used:







Data Preparation

Data to be used in the LVSR comes in many different standards

and sources
.

HTK works
most effectively

if the
data is in the same format before models are generated
. This creates

a need for a standard data
“interface” between an arbitrary corpus and the training process itself. This
section

will propose the
structure that each
source should follow prior to any model generation.


Data Preparation should be handled in
two phases
. First each corpus should be processed (probably in
a unique way) to provide items that characterize its specific data set. The next phase of data
prepa
ration will be to combine all of the contributing sources data into a single area ready for model
generation. This step will handle dictionary and list merging, grammar generation, etc. In this phase,
outside dictionaries may be included as well to suppl
ement.

Phase 1 : Corpus
-
specific Artifacts Needed by HTK

Prior to starting model generation, HTK needs the following items that are custom to each corpus:

1.


(OPTIONAL)

Dictionary



The list of
all
words found in
both testing and training files

in the
corpus

and their phonetic pronunciations. Should be “<corpus_name>_dict.txt”.

2.

List Files

a.

Word List


This is a list of all unique words found in the
transcriptions
.
“<corpus_name>_word_list.txt”

b.

Training Data List


List of all MFCC

data files

contributed b
y the source
, using their
absolute location on disk. Rename
all utterance files
to be

corpus_name>_
<speaker>_
<num>.mfcc”

3.

MLF Files


Master List Files that are used heavily by HTK training.

a.

“Plain” MLF’s


These only include the words of each utterance
.

Always create this
regardless of timing info availability.

i.

Word MLF

<corpus_name>
_word_mlf.txt

b.

“Timed” MLF’s


(OPTIONAL)
These included the time boundaries of the appearing
words/phones. They
must be converted to HTK timing as well. (HTK uses time uni
ts of
100ns per unit)

i.

Word MLF



<corpus>_word_mlf_timed.txt

ii.

Phone MLF


<corpus>_phone_mlf_timed.txt

4.

Audio Data


convert wav/
NIST
/sphere format into MFCC using common parameters. Make
sure tha
t max length of HTK is observed
,
splitting

as necessary
.
Use

HCopy
to perform the
conversion.
Create a map file with original file locations mapped to newly created names
below:

a.

MFCC’s



“<corpus>_<speaker (NULL if none)>_<uniq_id>.mfcc”

In addition to the data and custom perl script to handle each source, commo
n configuration information
will be needed to create the correct features, etc. to ensure that the data is uniform across all corpora.
The following configuration file is an example:

# Corpus location on disk

Location: F:/CORPORA/TIMIT


# Sound
-
splitting
threshold (in HTK units)

UtteranceSplit:
30


# Coding parameter config reference

CodingConfigFile: standard_mfcc
_cfg
.txt




Phase 2: Common Items for data merging

With each
corpus having been prepared in the previous step, the data must be merged together.

Also,
“global” data such as dictionaries should be added here, and merged to any dictionaries cont
ributed by
individual corpora.

1.


Dictionary



The list of
all
words found in
all files contributed

in the corpus and their phonetic
pronunciations.
If a
word is not found, errors should be generated and the execution shall halt.

Should be “<corpus_name>_dict.txt”. This should be generated by following two means:

a.

If phone
-
level transcriptions are available, map these to word
-
level transcriptions and
gener
ate dictionary.

b.

Use dictionary provided with corpus.

c.

Should have “sp” at the end of each entry
.

d.

Add “noice” words such as “lipsmack” and “cough”

e.

Use the HTK utility
HDMan

to merge dictionaries.

2.

Indexed Data Files


All the files from individual sources wil
l be merged into a common area and
their filenames will be transformed to

a common naming scheme. These files will map these
names back to the original.

3.

Final List Files

a.

Word List


This is a list of all unique words found in the dictionary. This is expa
nded
from just the transcriptions because new words may be encountered in model testing.
“word_list.txt”

b.

Training Data List


List of all MFCC training data files, using their absolute location on
disk. Rename all utterance files to be “<corpus_name>_<sp
eaker>_<num>.mfcc”

c.

Testing Data List


List of all MFCC testing data files, using their absolute location on
disk.


Rename all utterance files to be “<corpus_name>_<speaker>_<num>.mfcc”

4.

MLF Files


Master List Files that are used heavily by HTK training.

a.

“Plain” MLF’s


These only include the words of each utterance. Always create this
regardless of timing info availability.

i.

Word MLF
word_mlf.txt

ii.

Phone MLF
phone_mlf.txt (this is generated using
HLEd)

b.

“Timed” MLF’s


(OPTIONAL)
These included the time bou
ndaries of the appearing
words/phones. They must be converted to HTK timing as well. (HTK uses time units of
100ns per unit)

i.

Word MLF


<corpus>_word_mlf_timed.txt

ii.

Phone MLF


<corpus>_phone_mlf_timed.txt

5.

Transcription Files


These are transcription file
s that are formatted for direct use by the
Language Modeling process. These are generated based upon the previously created MLF files.

6.

Grammar File


By default, this step will generate an “open” grammar from the wordlist. Any
word can legally follow a
nother word in th
e final wordlist. This is used to test acoustic models
only

The configuration

# Phone
-
set information

PhoneSet:

TIMIT


# Coding parameter config reference

CodingConfigFile: standard_mfcc_cfg.txt


# Parameters to determine percentage of
input data that is TRAIN/TEST

# must add to 100

TrainDataPercent: 80

TestDataPercent: 20




Acoustic
Model Training

The Acoustic Model generation phase will generate multiple versions of HMM definition files that model
the input utterances on the phone, a
nd tri
-
phone level.

The following major events occur:

1.


Prototype HMM is created

2.

Create first HMM model for all phones

3.

Tie the states for silence model

4.

Re
-
align the models to use all word pronunciations

5.

Create tri
-
phone HMM models

6.

Use decision
-
based clust
ering to tie triphone model parameters

7.

Split the Gaussian Mixtures used for each state.

The following configuration will be used:

#Acoustic Training Configuration Profiles

ProfileName:
Basic


#settings for pruning and floor values

VarianceFloor: 0.01

Pruni
ngThresholds: 250.0 150.0 1000.0

RealignPruneThreshold: 250.0


#Which corpus contains bootstrap data for iteration 1

BootstrapCorpus: TIMIT


#how many calls to HEReest to do inbetween major AM steps

ReestimationCount: 2


#file for Tree based clustering log
ic

TreeEditFile: basic_tree.hed


#determine target mixtures to apply at end of training

GuassianMixtures: 8

MixtureStepSize: 2



Language Model Training

This phase of
development will create n
-
gram language model that will predict a symbol in a sequence
g
iven its n
-
1 predecessors. It is built on the assumption that probability of a specific n
-
gram can be
estimated of the frequency of occurrence in the training text. This performed with the following
workflow
:
(
5
)

1.


Training text is scanned and n
-
grams ar
e counted and stored in grammar files


2.

Words are mapped to an “Out
-
of
-
Vocabulary Class”. Other class mapping is applied for class
-
based Language Models


3.

The counts of the resulting grammar files are used to compute n
-
gram probabilities, which are
stored i
n the language model files.


4.

The
goodness

of the language model is measured by calculating
perplexity

against testing text
from the corpus.

This results in a language model for a specific data source (corpus). These will then carefully be
interpolated to
gether to form a language model across multiple sources.

The following configuration profile will be used to govern the Language Model generation process:

#these settings dictate the Language Model generation process for all sources

MaxNewWords: 100000

N
GramBufferSize: 200000


#will generate up to N gram models

NToGenerate: 4


FoFLevels: 32


#must include N
-
1 cutoff values

Cutoffs: 1, 2, 3


#how much this LM should contrib to the overall model

OverallContribution: 0.5



#class
-
model configuration items

C
lassAmount: 150

ClusterIterations: 1

ClassContribution: 0.7



Model Testing

The final phase of the system will be testing the acoustic and language models
generate to this point.

The results will be cataloged according to the timestamp and the profile na
me

1.


Recognition using acoustic models only and “open” grammar (i.e. no LM applied)

2.

Recognition using both AM and LM.

# standard HMM/LM testing parameters

WordInsertionPenalty: 0.0

GrammarScaleFactor: 5.0

HMMNumbersToTest: 19


Milestones

The following actio
ns are given in order with the time estimates for each:

1.

TIMIT Data Prep :



6 hours

2.

AMI Data Prep :



10 hours

3.

Phase 2 Data Prep Sub
-
System :

20 hours

4.

Acoustic Model Sub
-
System :


20 hours

5.

Model Testing Sub
-
System :


12 hours

6.

Lanugage Model Sub
-
Syste
m :


15 hours

7.

RTE ‘06 Data Prep :



14 hours

8.

Scheduling / Reporting :


14 hours

9.

Extra Features / Refactoring :


16 hours

10.

Profile Authoring :



4 hours



Total Effort Estimate:




131 hours


Open Issues and Questions


1.

Can Acoustic and Language Model gen
eration be run in parallel

after a common data
preparation workflow
?


2.


Right now, all data input into the LVSR is tagged as training data. What is the best way to
choose a subset of data for Testing only? Have a percentage configuration value and pick
ra
ndom utterances? Have a configurable list of specific utterances set aside?

If a source
(corpus) specifies a testing set, should we use this by default?


3.

Which workflow makes more sense for multiple source LM generation:

a.

Generate source
-
specific word lev
el LM, generate source
-
specific class level LM,
interpolate together. Then combine with other source
-
specific LM’s

b.

Use all training text to create a single word
-
level LM, generate class level LM, then
combine to final LM.

References


1.

"Perl." Wikipedia. 7
Dec. 2008 <http://en.wikipedia.org/wiki/perl>.

2.

"Cygwin." Wikipedia. 7 Dec. 2008 <http://en.wikipedia.org/wiki/cygwin>.

3.

"Subversion." Wikipedia. 7 Dec. 2008 <http://en.wikipedia.org/wiki/subversion>.

4.

"Comprehensive Perl Archive Network."
Comprehensive Perl
Archive Network
. 7 Dec. 2008
<http://www.cpan.org/>.

5.

Young, Steve,
et.
al.
The HTK Book
. 1995.