# A Survey of Boosting HMM Acoustic Model Training

IA et Robotique

7 nov. 2013 (il y a 7 années et 9 mois)

223 vue(s)

Presented by: Fang
-
Hui Chu

A Survey of Boosting HMM Acoustic Model
Training

2

Introduction

The
No Free Lunch Theorem

states that

There is no single learning algorithm that in any domain always
induces the most accurate learner

Learning is an
ill
-
posed problem

and with finite data, each
algorithm converges to a different solution and fails under
different circumstances

Though the performance of a learner may be fine
-
tuned, but still there
are instances on which even the best learner is not accurate enough

The idea is..

There may be another learner that is accurate on these instances

By suitably combining multiple learners then, accuracy can be improved

3

Introduction

Since there is no point in combining learners that always
make similar decisions

The aim is to be able to find a set of
base
-
learners

who differ in their
decisions so that they will complement each other

There are different ways the multiple base
-
learners are
combined to generate the final outputs:

Multiexpert

combination

methods

Voting

and its variants

Mixture of experts

Stacked generalization

Multistage

combination

methods

Cascading

4

Voting

The simplest way to combine multiple classifiers

which corresponds to taking a linear combination of the learners

this is also known as
ensembles

and
linear opinion pools

The name voting comes from its use in classification

if , called
plurality voting

if , called
majority voting

L
j
j
j
d
w
y
1
1

and

,
0
1

L
j
j
j
w
j
w

L
j
ji
j
i
d
w
y
1
K
i
K
,...,
1

outputs,

are

there
,

L
w
j
1

L
w
K
j
1
,
2

5

Bagging

Bagging is a voting method whereby base
-
learners are made
different by training them over slightly different training sets

is done by
bootstrap

where given a training set
X

of size
N
, we draw
N

instances randomly from
X

with replacement

In bagging, generating complementary base
-
learners is left to
chance and to the instability of the learning method

A learning algorithm is an
unstable algorithm

if small changes in the
training set causes a large difference in the generated learner

decision trees, multilayer perceptrons, condensed nearest neighbor

Bagging is short for
Bootstrap aggregating

Breiman, L. 1996.

Bagging Predictors.

Machine Learning

26, 123
-
140

6

Boosting

In boosting, we actively try to generate complementary base
-
learners by training the next learner on the mistakes of the
previous learners

The original boosting algorithms
(Schapire 1990)

combines three
weak learners

to generate a
strong learner

In the sense of the
probably approximately correct

(PAC) learning
model

Disadvantage

It requires a very large training sample

Schapire, R.E. 1990.

The Strength of Weak Learnability.

Machine Learning

5, 197
-
227

X
3

X
1

X
2

d
3

d
2

d
1

7

AdaBoost

AdaBoost, short for adaptive boosting, uses the same
training
set

over and over and thus
need not be large

and it
can also
combine an arbitrary number of base
-
learners
, not three

The idea is to modify the probabilities of drawing the instances
as a function of the error

The probability of a correctly classified instance is decreased, then a
new sample set is drawn from the original sample according to these
modified probabilities

That focuses more on instances misclassified by previous learner

Schapire
et al.

explain that the success of AdaBoost is due to
its property of increasing the
margin

Schapire.
et al.

1998.

Boosting the Margin: A New Explanation for Effectiveness of Voting Methods

Annals of Statistics

26, 1651
-
1686

Freund and Schapire. 1996.

Experiments with a New Boosting Algorithm

In
ICML

13, 148
-
156

8

AdaBoost.M2
(Freund and Schapire, 1997)

stop

;

2
1

if

j

Freund and Schapire. 1997.

A decision
-
theoretic generalization of on
-
line learning and an application
to boosting

Journal of Computer and System Sciences

55, 119
-
139

9

ICSLP 04

R. Zhang & A. Rudnicky

“A Frame Level Boosting
Training Scheme for Acoustic
Modeling”

Evolution of Boosting Algo.

ICSLP 96

G. Cook & T. Robinson

“Boosting the Performance of
Connectionist LVSR”

B

ICASSP 00

G. Zweig & M. Padmanabhan

“Boosting Gaussian Mixtures in An
LVCSR System”

E

1996

2000

ICASSP 03

R. Zhang & A. Rudnicky

“Improving the Performance of An
LVCSR System Through Ensembles
of Acoustic Models”

2003

ICASSP 04

C. Dimitrakakis & S. Bengio

“Boosting HMMs with An
Application to Speech Recognition”

ICSLP 04

R. Zhang & A. Rudnicky

“Optimizing Boosting with
Discriminative Criteria”

ICSLP 04

R. Zhang & A. Rudnicky

“Apply N
-
Best List Re
-
Ranking to
Acoustic Model Combinations of
Boosting Training”

EuroSpeech 05

R. Zhang et al.

“Investigations on Ensemble
Based Semi
-
Supervised
Acoustic Model Training”

ICSLP 06

R. Zhang & A. Rudnicky

“Investigations of Issues for
Using Multiple Acoustic
Models to Improve CSR”

EuroSpeech 03

R. Zhang & A. Rudnicky

“Comparative Study of Boosting
and Non
-
Boosting Training for
Constructing Ensembles of Acoustic
Models”

2004

2005

2006

0

1

2

3

4

5

6

ICASSP 02

C. Meyer

“Utterance
-
Level Boosting of HMM
Speech Recognition”

D

SpeechCom 06

C. Meyer & H. Schramm

“Boosting HMM Acoustic
Models in LVCSR”

ICASSP 02

I. Zitouni et al.

“Combination of Boosting and
Discriminative Training for Natural
Language Call Steering Systems”

A

2002

ICASSP 99

H. Schwenk

“Using Boosting to Improve a Hybrid
HMM/Neural Network Speech
Recognizer”

1999

Neural Network

EuroSpeech 97

G. Cook et al.

“Ensemble Methods for
Connectionist Acoustic
Modeling”

1997

GMM

HMM

Presented by: Fang
-
Hui Chu

Improving The Performance of An LVCSR
System Through Ensembles of Acoustic Models

ICASSP 2003

Rong Zhang and Alexander I. Rudnicky

Language Technologies Institute,

School of Computer Science

Carnegie Mellon University

11

Bagging vs. Boosting

Bagging

In each round, bagging randomly selects a number of examples from
the original training set, and produces a new single classifier based on
the selected subset

The final classifier is built by choosing the hypothesis best agreed on by
single classifiers

Boosting

In boosting, the single classifiers are iteratively trained in a fashion such
that hard
-
to
-
classify examples are given increasing emphasis

A parameter that measures the classifier’s importance is determined in
respect of its classification accuracy

The final hypothesis is the weighted majority vote from the single
classifiers

12

Algorithms

The first algorithm is based on the intuition that an incorrectly
recognized utterance should receive more attention in training

If the weight of an utterance is 2.6, we first add two copies of the utterance
to the new training set, and then add its third copy with probability 0.6

13

Algorithms

The exponential increase
in the size of training set
is a severe problem for
algorithm 1

Algorithm 2 is proposed
to address this problem

14

Algorithms

In algorithm 1 and 2, there is
no concern to measure how
important a model is relative
to others

Good model should play more
important role than bad one

x
x
1
1
exp
i
T
t
i
t
t
e
c
L

otherwise
if
e
t
i
t
i
i
t

0

x

x
x
x
x
x
1
1
1
1
1
exp
exp
exp
i
i
T
T
T
i
i
i
T
T
T
t
i
t
t
e
c
w
e
c
e
c
L
15

Experiments

Corpus : CMU Communicator system

Experimental results :

Presented by: Fang
-
Hui Chu

Comparative Study of Boosting and Non
-
Boosting Training for Constructing Ensembles
of Acoustic Models

Rong Zhang and Alexander I. Rudnicky

Language Technologies Institute, CMU

EuroSpeech 2003

17

Non
-
Boosting method

Bagging

is a commonly used method in machine learning field

randomly selects a number of examples from the original training set
and produces a new single classifier

in this paper, we call it a non
-
Boosting method

Based on the intuition

The misrecognized utterance should receive more attention in the
successive training

18

Algorithms

λ is a parameter that prevents the size of
the training set from being too large.

19

Experiments

The corpus:

Training set: 31248 utterances; Test set: 1689 utterances

Presented by: Fang
-
Hui Chu

A Frame Level Boosting Training Scheme for
Acoustic Modeling

ICSLP 2004

Rong Zhang and Alexander I. Rudnicky

Language Technologies Institute,

School of Computer Science

Carnegie Mellon University

21

Introduction

In the current Boosting algorithm,
utterance

is the basic unit
used for acoustic model training

Our analysis shows that there are two notable weaknesses in
this setting..

First, the objective function of current Boosting algorithm is designed to
minimize utterance error instead of word error

Second, in the current algorithm, an utterance is treated as a unity for resample

This paper proposes a frame level Boosting training scheme
for acoustic modeling to address these two problems

22

is the pseudo loss for frame
t
, which describes
the
degree of confusion of this frame

for recognition

Frame Level Boosting Training Scheme

The metrics that we will use in Boosting training is the frame
level conditional probability
-----
(word level)

Objective function :

x
|
t
a
P

NBest
t
t
NBest
h
a
h
t
t
h
P
h
P
P
a
P
a
P
x
x
x
x
x
x
x

,
,
,
|
label
,

N
i
T
t
a
a
i
i
i
i
t
a
P
a
P
L
1
1
|
|
exp
x
x

t
a
a
i
i
i
a
P
a
P
x
x
|
|
exp
23

Frame Level Boosting Training Scheme

Training Scheme:

How to resample the frame
level training data?

to duplicate for times
and creates a new utterance
for acoustic model training

t
i
,
x

t
i
,
x
24

Experiments

Corpus : CMU Communicator system

Experimental results :

Presented by: Fang
-
Hui Chu

Boosting HMM acoustic models in large
vocabulary speech recognition

Carsten Meyer, Hauke Schramm

Philips Research Laboratories, Germany

SPEECH COMMUNICATION 2006

26

Utterance approach for boosting in ASR

An intuitive way of applying boosting to HMM speech
recognition is at the utterance level

Thus, boosting is used to improve upon an initial ranking of candidate
word sequences

The utterance approach has two advantages:

First, it is directly related to the sentence error rate

Second, it is computationally much less expensive than boosting applied
at the level of feature vectors

27

Utterance approach for boosting in ASR

In utterance approach, we define the input patterns to be
the sequence of feature vectors corresponding to the entire
utterance

denotes one possible candidate word sequence of the
speech recognizer, being the correct word sequence for
utterance

The a posteriori confidence measure is calculated on basis of
the
N
-
best list for utterance

i
x
i
i
y
i
y
i
i
L

i
L
z
i
t
i
t
i
t
z
p
z
x
p
y
p
y
x
p
y
x
h

,
28

Utterance approach for boosting in ASR

Based on the confidence values and AdaBoost.M2 algorithm,
we calculate an utterance weight for each training
utterance

Subsequently, the weight are used in maximum likelihood and
discriminative training of Gaussian mixture model

i
)
(
i
w
t

N
i
i
i
t
t
ML
y
x
p
i
w
F
1
,
log

N
i
y
i
i
i
t
t
MMI
y
p
y
x
p
y
x
p
i
w
F
1
,
log

29

Utterance approach for boosting in ASR

Some problem encountered when apply it to large
-
scale
continuous speech application:

The
N
-
best lists of reasonable length (e.g.
N
=100) generally contain
only a tiny fraction of the possible classification results

This has two consequences:

In training, it may lead to sub
-
optimal utterance weights

In recognition, Eq. (1) cannot be applied appropriately

i
L
z
i
t
i
t
i
t
z
p
z
x
p
y
p
y
x
p
y
x
h

,
)
,
(
1
ln
max
arg
)
(
1
y
x
h
x
H
t
T
t
t
Y
y

30

Utterance approach for CSR
--
Training

Training

A convenient strategy to reduce the complexity of the classification task
and to provide more meaningful
N
-
best lists consists in “chopping” of the
training data

For long sentences, it simply means to insert additional sentence break
symbols at silence intervals with a given minimum length

This reduces the number of possible classifications of each sentence
“fragment”, so that the resulting
N
-
best lists should cover a sufficiently
large fraction of hypotheses

31

Utterance approach for CSR
--
Decoding

Decoding: lexical approach for model combination

A single pass decoding setup, where the combination of the boosted
acoustic models is realized at a
lexical level

The basic idea is to add a new pronunciation model by “replicating” the
set of phoneme symbols in each boosting iteration (e.g. by appending
the suffix “_
t
” to the phoneme symbol)

The new phoneme symbols, represent the underlying acoustic model of
boosting iteration

t
“au”, “au_1” ,“au_2”,…

t
32

Utterance approach for CSR
--
Decoding

Decoding: lexical approach for model combination (cont.)

Add to each phonetic transcription in the decoding lexicon a new
transcription using the corresponding phoneme set

Use the
reweighted

training data to train the boosted classifier

Decoding is then performed using the extended lexicon and the set of
acoustic models
weighted by their unigram prior probabilities

which
are estimated on the training data

t
M
“sic_a”, “sic_1 a_1” ,…

t
weighted summation

33

In more detail

Boosting

Iteration t

“_t”

M
t

Training

corpus

training

corpus(
M
t
)

phonetically

transcribed

)
(
i
w
t
ML/MMI

training

M
1
,M
2
,…,M
t

Lexicon

pronunciation

variant

extend

Training

“sic_a”, “sic_1 a_1” ,…

unweighted model combination

weighted model combination

)
(
1
1
1
)
(
1
1
1
1
1
)
(
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
,
,
,
where
,
max
arg
max
arg
N
N
N
N
N
N
N
N
w
R
v
N
i
i
i
i
i
i
m
i
i
w
R
v
N
M
N
N
N
w
R
v
M
N
N
M
N
M
N
w
M
N
w
N
v
x
p
w
v
p
w
w
p
v
x
p
w
v
p
w
p
x
v
w
p
x
w
p
x
w
p
x
w
p
w

Decoding

34

In more detail

35

Weighted model combination

i
i
N
N
N
N
N
N
t
t
i
T
t
w
R
v
N
i
i
i
i
i
i
i
i
i
m
i
i
T
t
w
R
v
N
M
N
N
N
T
t
w
R
v
M
N
N
M
N
t
p
t
v
x
p
t
p
t
w
v
p
w
w
p
t
v
x
p
t
w
v
p
t
w
p
x
t
v
w
p
x
w
p

1
ln
,
simplicity
for
,
,
,
,
,
,
,
,
1
)
(
1
1
1
1
)
(
1
1
1
1
1
1
)
(
1
1
1
1
1
1
1
1
1
1
1

Word level model combination

36

Experiments

Isolated word recognition

Telephone
-
bandwidth large vocabulary isolated word recognition

SpeechDat(II) German meterial

Continuous speech recognition

Professional dictation and Switchboard

37

Isolated word recognition

Database:

Training corpus: consists of 18k utterances (4.3h) of city, company, first
and family names

Evaluations:

LILI

test corpus: 10k single word utterances (3.5h); 10k words lexicon;
(matched conditions)

Names

corpus: an inhouse collection of 676 utterances (0.5h); two different
decoding lexica: 10k lex, 190k lex; (acoustic conditions are matched,
whereas there is a lexical mismatch)

Office

corpus: 3.2k utterances (1.5h), recorded over microphone in clean
conditions; 20k lexicon; (an acoustic mismatch to the training conditions)

38

Isolated word recognition

Boosting ML models

39

Isolated word recognition

Combining boosting and discriminative training

The experiments in isolated word recognition showed that boosting may
improve the best test error rates

40

Continuous speech recognition

Database

Professional dictation

An inhouse data collection of real
-
life recordings of medical reports

The acoustic training corpus consists of about 58h of data

Evaluations were carried out on two test corpora:

Development corpus consists of 5.0h of speech

Evaluation corpus consists of 3.3h of speech

Switchboard

Consisting of spontaneous conversations recorded over telephone line;
57h(73h) of male(female)

Evaluations corpus:

Containing about 1h(0.5h) of male(female)

41

Continuous speech recognition

Professional dictation:

42

Switchboard:

43

Conclusions

In this paper, a boosting approach which can be applied to
any HMM based speech recognizer was be presented and
evaluated

The increased recognizer complexity and thus decoding effort
of the boosted systems is a major drawback compared to
other training techniques like discriminative training

44

Probably Approximately Correct Learning

We would like our hypothesis to be approximately correct,
namely, that
the error probability be bounded by some value

We also would like to be confident in our hypothesis in that we
want to know that our hypothesis will be correct most of the
time, so we want to be probably correct as well

Given a class, , and examples drawn from some unknown
but fixed probability distribution, such that with probability
at least , the hypothesis has error at most , for
arbitrary and

C
)
(
x
p

1
h

2
1

0

1
h
C
P
45

Probably Approximately Correct Learning

How many training examples
N

should we have, such that with
probability at least

1 ‒ δ,
h

has
error at most

ε ?

most general hypothesis,
G

most specific hypothesis,
S

h

H
, between
S

and
G

is

consistent

and make up the

version space

Each strip is at most ε/4

Pr that we miss a strip 1‒ ε/4

Pr that
N

instances miss a strip (1 ‒ ε/4)
N

Pr that
N

instances miss 4 strips 4(1 ‒
ε
/4)
N

4(1 ‒ ε/4)
N

≤ δ and (1 ‒ x)≤exp( ‒ x)

4exp(‒ ε
N
/4) ≤ δ and
N

≥ (4/ε)log(4/δ)