HTK Tutorial

reelingripebeltUrban and Civil

Nov 15, 2013 (3 years and 4 months ago)

90 views

K.Marasek

05.07.2005

Multimedia Department

HTK Tutorial

Prepared using HTKBook

K.Marasek

05.07.2005

Multimedia Department

Software architecture


toolkit for Hidden Markov Modeling


optimized for Speech Recognition


very flexible and complete


very good documentation (HTK Book)


Data Preparation Tools


Training Tools


Recognition Tools


Analysis Tool


K.Marasek

05.07.2005

Multimedia Department

General concepts


Set of programs with command
-
line style interface


Each tool has a number of required arguments plus optional arguments. The latter are
always prefixed by a minus sign.


HFoo
-
T 1
-
f 34.3
-
a
-
s myfile file1 file2



Options whose names are a capital letter have the same meaning across all tools. For
example, the
-
T option is always used to control the trace output of a HTK tool.


In addition to command line arguments, the operation of a tool can be controlled by
parameters stored in a configuration file. For example, if the command


HFoo
-
C config
-
f 34.3
-
a
-
s myfile file1 file2


is executed, the tool HFoo will load the parameters stored in the configuration file config during
its initialisation procedures



the HTK data formats


audio: many common formats plus HTK binary


features: HTK binary


labels: HTK (single or Master Label

les) text


models: HTK (single or Master Macro

les) text or binary


other: HTK text


K.Marasek

05.07.2005

Multimedia Department

Data preparation tools


data manipulation tools:


HCopy


parametrze signals

HQuant
-

vector quantization

HLEd


label editor

HHEd


model editor (master model
file)

HDMan
-

dictionary editor

HBuild


language model conversion

HParse


lattice file preparation
(grammar conversion)



data visualization tools:


HSLab
-

speech label manipulation

HList
-

data display and manipulation

HSGen


generate sentences out of
regular grammar


K.Marasek

05.07.2005

Multimedia Department

Training tools

The actual training process takes place in stages and it is illustrated in more
detail in Fig.

2.3

. Firstly, an initial set of models must be created. If
there is some speech data available for which the location of the sub
-
word (i.e. phone) boundaries have been marked, then this can be used
as
bootstrap data
. In this case, the tools
HInit

and
HRest

provide
isolated word

style training using the fully labelled bootstrap data.
Each of the required HMMs is generated individually.
HInit

reads in all
of the bootstrap training data and
cuts out

all of the examples of the
required phone. It then iteratively computes an initial set of parameter
values using a
segmental k
-
means

procedure. On the first cycle, the
training data is uniformly segmented, each model state is matched with
the corresponding data segments and then means and variances are
estimated. If mixture Gaussian models are being trained, then a
modified form of k
-
means clustering is used. On the second and
successive cycles, the uniform segmentation is replaced by Viterbi
alignment. The initial parameter values computed by
HInit

are then
further re
-
estimated by
HRest
. Again, the fully labelled bootstrap data
is used but this time the segmental k
-
means procedure is replaced by
the Baum
-
Welch re
-
estimation procedure described in the previous
chapter. When no bootstrap data is available, a so
-
called
flat start

can
be used. In this case all of the phone models are initialised to be
identical and have state means and variances equal to the global
speech mean and variance. The tool
HCompV

can be used for this.

Once an initial set of models has been created, the tool
HErest

is used to
perform embedded training using the entire training set.
HErest

performs a single Baum
-
Welch re
-
estimation of the whole set of HMM
phone models simultaneously. For each training utterance, the
corresponding phone models are concatenated and then the forward
-
backward algorithm is used to accumulate the statistics of state
occupation, means, variances, etc., for each HMM in the sequence.
When all of the training data has been processed, the accumulated
statistics are used to compute re
-
estimates of the HMM parameters.
HErest

is the core HTK training tool. It is designed to process large
databases, it has facilities for pruning to reduce computation and it can
be run in parallel across a network of machines


K.Marasek

05.07.2005

Multimedia Department

Recognition and analysis tools


HVite



performs Viterbi
-
based speech recognition.
HVITE takes as input a network describing the allowable
word sequences, a dictionary defining how each word is
pronounced and a set of HMMs. It operates by
converting the word network to a phone network and
then attaching the appropriate HMM definition to each
phone instance. Recognition can then be performed on
either a list of stored speech files or on direct audio
input. As noted at the end of the last chapter, HVITE
can support cross
-
word triphones and it can run with
multiple tokens to generate lattices containing multiple
hypotheses. It can also be configured to rescore lattices
and perform forced alignments.



HResults

uses dynamic programming to align the two
transcriptions and then count substitution, deletion and
insertion errors. Options are provided to ensure that the
algorithms and output formats used by HRESULTS are
compatible with those used by the US National Institute
of Standards and Technology (NIST). As well as global
performance measures, HRESULTS can also provide
speaker
-
by
-
speaker breakdowns, confusion matrices
and time
-
aligned transcriptions. For word spotting
applications, it can also compute Figure of Merit (FOM)
scores and Receiver Operating Curve (ROC)
information.


K.Marasek

05.07.2005

Multimedia Department

How to use HTK in 10 easy steps


Step 1
.
Set the task


Prepare the grammar in the BNF format:



[.]
optional



{.}
zero or more



(.)
block



<.>
loop



<<.>>
context dep. loop



.|.
alternative


Compile grammar to lattice format


D:
\
htk
-
3.1
\
bin.win32
\
HParse location
-
grammar lg.lat



$location= where is | how to find | how to come to;

$ex= sorry | excuse me | pardon;

$intro= can you tell me | do you know ;

$address= acton town | admirality arch | baker street | bond street| big ben | blackhorse road |
buckingham palace | cambridge | canterbury | charing cross road | covent garden | downing
street | ealing | edgware road | finchley road |

gloucester road | greenwich | heathrow airport | high street | house of parliament | hyde park |
kensington | king's cross | leicester square | marble arch | old street | paddington station |
piccadilly circus | portobello market | regent's park | thames river | tower bridge | trafalgar
square | victoria station | westminster abbey | whitehall | wimbledon | windsor;

$end= please;


(!ENTER{_SIL_}({$ex} {into} {$location} $address {$end}){_SIL_}!EXIT)

K.Marasek

05.07.2005

Multimedia Department

How to use HTK in 10 easy steps


Step 2


prepare pronunciation dictionary


Find the list of words using in the task


lg.wlist


Prepare dictionary by hand, automatically or using standard pronounciation
dictionary (e.g. Beep for British English)


Or use the whole Beep dictionary




where [where] 1.0 w e@


where [where] 1.0 w e@ r


is [is] 1.0 I z


how [how] 1.0 h aU


admirality [admirality] 1.0 { d m @ r @ l i: t i:


palace [palace] 1.0 p { l I s


K.Marasek

05.07.2005

Multimedia Department

How to use HTK in 10 easy steps


Step 3
-

Record the Training and Test Data



HTK has a tool for prompts recordings HSLab but it is working under Linux only


Usually other programs used for that


First generate prompts than record them


D:
\
htk
-
3.1
\
bin.win32
\
HSGen
-
l
-
n 200 lg.lat beep.dic > lg.200

1. how to come to baker street _SIL_ !EXIT

2. ealing please _SIL_ !EXIT

3. heathrow airport !EXIT

4. leicester square _SIL_ !EXIT

5. king's cross please _SIL_ !EXIT

6. hyde park _SIL_ !EXIT

7. _SIL_ greenwich please _SIL_ _SIL_ _SIL_ _SIL_ _SIL_ !EXIT

8. old street !EXIT

9. high street _SIL_ _SIL_ _SIL_ _SIL_ !EXIT

10. whitehall !EXIT

11. old street !EXIT

12. canterbury please !EXIT

13. into edgware road !EXIT

14. whitehall _SIL_ !EXIT

15. whitehall _SIL_ !EXIT

16. finchley road please please please _SIL_ !EXIT


Record prompts and store in chosen format: 16 kHz, 16
-
bit, headerless (?)




K.Marasek

05.07.2005

Multimedia Department

How to use HTK in 10 easy steps


Step 4
-

Create the Transcription Files



In the HTK all transcription files can be merged into one
Master Label File (MLF)



usually it is enough to have word level transcripts


If phone level necessary it can be automatically generated using HLEd















#!MLF!#

"*/S0001.lab"

how

to

come to

baker

street


"*/S0002.lab"

ealing

please

(etc...)

K.Marasek

05.07.2005

Multimedia Department

How to use HTK in 10 easy steps



Step 5
-

Parametrize the Data



Use HCopy: compute MFCC and delta
parameters


Use config file to set all the options
(hcopy.conf)


HCopy
-
T 1
-
C hcopy.conf
-
S file.list





### hcopy.conf

###input file specific section

SOURCEFORMAT = NOHEAD

HEADERSIZE = 0

#16kHz corresponds to 0.0625 msec

SOURCERATE

= 625

###

###analysis section

###

# no DC offset correction

ZMEANSOURCE = FALSE


# no random noise added

ADDDITHER = 0.0

#preemphasis

PREEMCOEF = 0.97

#windowing

TARGETRATE

= 100000

WINDOWSIZE

= 250000

USEHAMMING

= TRUE

#fbank analysis

NUMCHANS

= 24

LOFREQ

= 80

HIFREQ

= 7500

#don't take the sqrt:

USEPOWER

= TRUE

#cepstrum calculation

NUMCEPS

= 12

CEPLIFTER

= 22

#energy

ENORMALISE

= FALSE

ESCALE

= 1.0

RAWENERGY

= FALSE

#delta and delta
-
delta

DELTAWINDOW = 2

ACCWINDOW

= 2

SIMPLEDIFFS

= FALSE

###

###output file specific section

###

TARGETKIND

= MFCC_D_A_0

TARGETFORMAT

= HTK

SAVECOMPRESSED

= TRUE

SAVEWITHCRC

= TRUE

K.Marasek

05.07.2005

Multimedia Department


~o <VecSize> 39 <MFCC_D_A_0> <StreamInfo> 1 39


~h "p"

<BeginHMM>


<NumStates> 5


<State> 2 <NumMixes> 1


<Stream> 1


<Mixture> 1 1.0000


<Mean> 39


0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0


<Variance> 39


1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.
0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0


<State> 3 <NumMixes> 1


<Stream> 1


<Mixture> 1 1.0000


<Mean> 39


0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0


<Variance> 39


1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.
0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0


<State> 4 <NumMixes> 1


<Stream> 1


<Mixture> 1 1.0000


<Mean> 39


0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0


<Variance> 39


1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.
0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0


<TransP> 5


0.000e+0 1.000e+0 0.000e+0 0.000e+0 0.000e+0


0.000e+0 6.000e
-
1 4.000e
-
1 0.000e+0 0.000e+0


0.000e+0 0.000e+0 6.000e
-
1 4.000e
-
1 0.000e+0


0.000e+0 0.000e+0 0.000e+0 6.000e
-
1 4.000e
-
1


0.000e+0 0.000e+0 0.000e+0 0.000e+0 0.000e+0

<EndHMM>




Step 6


Create Monophone
HMMs



define a prototype model and clone
for all phones














K.Marasek

05.07.2005

Multimedia Department

How to use HTK in 10 easy steps


Step 7


Initialize models


Use Hinit:
HInit
-
S trainlist
-
H globals
-
M dir1 proto


Firstly, the Viterbi algorithm is used to find the most likely state sequence
corresponding to each training example, then the HMM parameters are estimated. As a
side
-
effect of finding the Viterbi state alignment, the log likelihood of the training data
can be computed. Hence, the whole estimation process can be repeated until no
further increase in likelihood is obtained
.



if no initial data use
HCompV

for

flat start initialization

will scan a set of data files, compute

the global mean and variance and set

all of the Gaussians in a given HMM


to have the same mean and variance


K.Marasek

05.07.2005

Multimedia Department

How to use HTK in 10 easy steps



Step 8
-

Isolated Unit Re
-
Estimation using HRest




Its operation is very similar to HInit except that, it expects
the input HMM definition to have been initialised and it uses
Baum
-
Welch re
-
estimation in place of Viterbi training


whereas Viterbi training makes a hard decision as to which
state each training vector was ``generated'' by, Baum
-
Welch takes a soft decision. This can be helpful when
estimating phone
-
based HMMs since there are no hard
boundaries between phones in real speech and using a soft
decision may give better results.


HRest
-
S trainlist
-
H dir1/globals
-
M dir2
-
l ih
-
L labs
dir1/ih


This will load the HMM definition for /ih/ from
dir1
, re
-
estimate the parameters using the speech segments
labelled with ih and write the new definition to directory
dir2
.

K.Marasek

05.07.2005

Multimedia Department

How to use HTK in 10 easy steps


Step 9
-

Embedded Training using HERest



HERest embedded training simultaneously updates all
of the HMMs in a system using all of the training data.


On startup,
HERest

loads in a complete set of HMM
definitions. Every training file must have an associated
label file which gives a transcription for that file. Only the
sequence of labels is used by
HERest
, however, and
any boundary location information is ignored. Thus,
these transcriptions can be generated automatically
from the known orthography of what was said and a
pronunciation dictionary.


HERest

processes each training file in turn. After
loading it into memory, it uses the associated
transcription to construct a composite HMM which
spans the whole utterance. This composite HMM is
made by concatenating instances of the phone HMMs
corresponding to each label in the transcription. The
Forward
-
Backward algorithm is then applied and the
sums needed to form the weighted averages
accumulated in the normal way. When all of the training
files have been processed, the new parameter
estimates are formed from the weighted sums and the
updated HMM set is output.


HERest
-
t 120.0 60.0 240.0
-
S trainlist
-
I labs
\

-
H
dir1/hmacs
-
M dir2 hmmlist


-
t : beam limits


Can be used to prepare context
-
dependent models

K.Marasek

05.07.2005

Multimedia Department

How to use HTK in 10 easy steps


Step 10
-

Use HVite to recognize utterances and HResults to
evaluate recognition rate


D:
\
htk
-
3.1
\
bin.win32
\
HVite
-
g
-
w lg.lat
-
H wsjcam0.mmf

S test.list
-
C hvite.conf

i recresults. mlf beep.dic wsjcam0.mlist


A lot of other options to be set (beam width, scale factors, weights, etc.)


On line:

D:
\
htk
-
3.1
\
bin.win32
\
HVite
-
g
-
w lg.lat
-
H wsjcam0.mmf
-
C live.conf beep.dic
wsjcam0.mlist


Statistics of results:

HResults
-
I testref.mlf tiedlist recout.mlf





====================== HTK Results Analysis ==============

Ref : testrefs.mlf

Rec : recout.mlf

------------------------

Overall Results
-----------------


SENT: %Correct=98.50 [H=197, S=3, N=200] WORD: %Corr=99.77, Acc=99.65 [H=853, D=1, S=1, I=1, N=855]
==========================================================

N = total number, I = insertions, S = substitutions,

D = deletions

correct: H = N
-
S
-
D, %Corr=H/N, Acc=(H
-
I)/N

K.Marasek

05.07.2005

Multimedia Department

Bye Bye


Thanks for your participation!