A Survey of Boosting HMM Acoustic Model Training

strawberrycokevilleΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 4 χρόνια και 1 μέρα)

70 εμφανίσεις

Presented by: Fang
-
Hui Chu

A Survey of Boosting HMM Acoustic Model
Training

2

Introduction


The
No Free Lunch Theorem

states that


There is no single learning algorithm that in any domain always
induces the most accurate learner



Learning is an
ill
-
posed problem

and with finite data, each
algorithm converges to a different solution and fails under
different circumstances


Though the performance of a learner may be fine
-
tuned, but still there
are instances on which even the best learner is not accurate enough



The idea is..


There may be another learner that is accurate on these instances


By suitably combining multiple learners then, accuracy can be improved

3

Introduction


Since there is no point in combining learners that always
make similar decisions


The aim is to be able to find a set of
base
-
learners

who differ in their
decisions so that they will complement each other



There are different ways the multiple base
-
learners are
combined to generate the final outputs:


Multiexpert

combination

methods


Voting

and its variants


Mixture of experts


Stacked generalization


Multistage

combination

methods


Cascading


4

Voting


The simplest way to combine multiple classifiers


which corresponds to taking a linear combination of the learners


this is also known as
ensembles

and
linear opinion pools






The name voting comes from its use in classification






if , called
plurality voting


if , called
majority voting




L
j
j
j
d
w
y
1
1

and

,
0
1





L
j
j
j
w
j
w



L
j
ji
j
i
d
w
y
1
K
i
K
,...,
1

outputs,


are

there
,

L
w
j
1

L
w
K
j
1
,
2


5

Bagging


Bagging is a voting method whereby base
-
learners are made
different by training them over slightly different training sets


is done by
bootstrap


where given a training set
X

of size
N
, we draw
N

instances randomly from
X

with replacement



In bagging, generating complementary base
-
learners is left to
chance and to the instability of the learning method


A learning algorithm is an
unstable algorithm

if small changes in the
training set causes a large difference in the generated learner


decision trees, multilayer perceptrons, condensed nearest neighbor



Bagging is short for
Bootstrap aggregating

Breiman, L. 1996.

Bagging Predictors.


Machine Learning

26, 123
-
140

6

Boosting


In boosting, we actively try to generate complementary base
-
learners by training the next learner on the mistakes of the
previous learners



The original boosting algorithms
(Schapire 1990)

combines three
weak learners

to generate a
strong learner


In the sense of the
probably approximately correct

(PAC) learning
model



Disadvantage


It requires a very large training sample

Schapire, R.E. 1990.

The Strength of Weak Learnability.


Machine Learning

5, 197
-
227

X
3

X
1

X
2

d
3

d
2

d
1

7

AdaBoost


AdaBoost, short for adaptive boosting, uses the same
training
set

over and over and thus
need not be large

and it
can also
combine an arbitrary number of base
-
learners
, not three



The idea is to modify the probabilities of drawing the instances
as a function of the error


The probability of a correctly classified instance is decreased, then a
new sample set is drawn from the original sample according to these
modified probabilities


That focuses more on instances misclassified by previous learner



Schapire
et al.

explain that the success of AdaBoost is due to
its property of increasing the
margin


Schapire.
et al.

1998.

Boosting the Margin: A New Explanation for Effectiveness of Voting Methods


Annals of Statistics

26, 1651
-
1686

Freund and Schapire. 1996.

Experiments with a New Boosting Algorithm


In
ICML

13, 148
-
156

8


AdaBoost.M2
(Freund and Schapire, 1997)

stop

;

2
1

if

j

Freund and Schapire. 1997.

A decision
-
theoretic generalization of on
-
line learning and an application
to boosting


Journal of Computer and System Sciences

55, 119
-
139

9

ICSLP 04

R. Zhang & A. Rudnicky

“A Frame Level Boosting
Training Scheme for Acoustic
Modeling”

Evolution of Boosting Algo.

ICSLP 96

G. Cook & T. Robinson

“Boosting the Performance of
Connectionist LVSR”

B

ICASSP 00

G. Zweig & M. Padmanabhan

“Boosting Gaussian Mixtures in An
LVCSR System”

E

1996

2000

ICASSP 03

R. Zhang & A. Rudnicky

“Improving the Performance of An
LVCSR System Through Ensembles
of Acoustic Models”

2003

ICASSP 04

C. Dimitrakakis & S. Bengio

“Boosting HMMs with An
Application to Speech Recognition”

ICSLP 04

R. Zhang & A. Rudnicky

“Optimizing Boosting with
Discriminative Criteria”

ICSLP 04

R. Zhang & A. Rudnicky

“Apply N
-
Best List Re
-
Ranking to
Acoustic Model Combinations of
Boosting Training”

EuroSpeech 05

R. Zhang et al.

“Investigations on Ensemble
Based Semi
-
Supervised
Acoustic Model Training”

ICSLP 06

R. Zhang & A. Rudnicky

“Investigations of Issues for
Using Multiple Acoustic
Models to Improve CSR”

EuroSpeech 03

R. Zhang & A. Rudnicky

“Comparative Study of Boosting
and Non
-
Boosting Training for
Constructing Ensembles of Acoustic
Models”

2004

2005

2006

0

1

2

3

4

5

6

ICASSP 02

C. Meyer

“Utterance
-
Level Boosting of HMM
Speech Recognition”

D

SpeechCom 06

C. Meyer & H. Schramm

“Boosting HMM Acoustic
Models in LVCSR”

ICASSP 02

I. Zitouni et al.

“Combination of Boosting and
Discriminative Training for Natural
Language Call Steering Systems”

A

2002

ICASSP 99

H. Schwenk

“Using Boosting to Improve a Hybrid
HMM/Neural Network Speech
Recognizer”

1999

Neural Network

EuroSpeech 97

G. Cook et al.

“Ensemble Methods for
Connectionist Acoustic
Modeling”

1997

GMM

HMM

Presented by: Fang
-
Hui Chu

Improving The Performance of An LVCSR
System Through Ensembles of Acoustic Models

ICASSP 2003

Rong Zhang and Alexander I. Rudnicky

Language Technologies Institute,

School of Computer Science

Carnegie Mellon University

11

Bagging vs. Boosting


Bagging


In each round, bagging randomly selects a number of examples from
the original training set, and produces a new single classifier based on
the selected subset


The final classifier is built by choosing the hypothesis best agreed on by
single classifiers



Boosting


In boosting, the single classifiers are iteratively trained in a fashion such
that hard
-
to
-
classify examples are given increasing emphasis


A parameter that measures the classifier’s importance is determined in
respect of its classification accuracy


The final hypothesis is the weighted majority vote from the single
classifiers

12

Algorithms


The first algorithm is based on the intuition that an incorrectly
recognized utterance should receive more attention in training











If the weight of an utterance is 2.6, we first add two copies of the utterance
to the new training set, and then add its third copy with probability 0.6

13

Algorithms


The exponential increase
in the size of training set
is a severe problem for
algorithm 1



Algorithm 2 is proposed
to address this problem

14

Algorithms


In algorithm 1 and 2, there is
no concern to measure how
important a model is relative
to others


Good model should play more
important role than bad one
















x
x
1
1
exp
i
T
t
i
t
t
e
c
L










otherwise
if
e
t
i
t
i
i
t

0





x





























x
x
x
x
x
1
1
1
1
1
exp
exp
exp
i
i
T
T
T
i
i
i
T
T
T
t
i
t
t
e
c
w
e
c
e
c
L
15

Experiments


Corpus : CMU Communicator system


Experimental results :

Presented by: Fang
-
Hui Chu

Comparative Study of Boosting and Non
-
Boosting Training for Constructing Ensembles
of Acoustic Models

Rong Zhang and Alexander I. Rudnicky

Language Technologies Institute, CMU


EuroSpeech 2003

17

Non
-
Boosting method


Bagging


is a commonly used method in machine learning field


randomly selects a number of examples from the original training set
and produces a new single classifier


in this paper, we call it a non
-
Boosting method



Based on the intuition


The misrecognized utterance should receive more attention in the
successive training

18

Algorithms

λ is a parameter that prevents the size of
the training set from being too large.


19

Experiments


The corpus:


Training set: 31248 utterances; Test set: 1689 utterances

Presented by: Fang
-
Hui Chu

A Frame Level Boosting Training Scheme for
Acoustic Modeling

ICSLP 2004

Rong Zhang and Alexander I. Rudnicky

Language Technologies Institute,

School of Computer Science

Carnegie Mellon University

21

Introduction


In the current Boosting algorithm,
utterance

is the basic unit
used for acoustic model training



Our analysis shows that there are two notable weaknesses in
this setting..


First, the objective function of current Boosting algorithm is designed to
minimize utterance error instead of word error


Second, in the current algorithm, an utterance is treated as a unity for resample



This paper proposes a frame level Boosting training scheme
for acoustic modeling to address these two problems

22


is the pseudo loss for frame
t
, which describes
the
degree of confusion of this frame

for recognition

Frame Level Boosting Training Scheme


The metrics that we will use in Boosting training is the frame
level conditional probability
-----
(word level)







Objective function :




x
|
t
a
P

















NBest
t
t
NBest
h
a
h
t
t
h
P
h
P
P
a
P
a
P
x
x
x
x
x
x
x


,
,
,
|
label
,














N
i
T
t
a
a
i
i
i
i
t
a
P
a
P
L
1
1
|
|
exp
x
x









t
a
a
i
i
i
a
P
a
P
x
x
|
|
exp
23

Frame Level Boosting Training Scheme


Training Scheme:


How to resample the frame
level training data?


to duplicate for times
and creates a new utterance
for acoustic model training

t
i
,
x


t
i
,
x
24

Experiments


Corpus : CMU Communicator system


Experimental results :







Presented by: Fang
-
Hui Chu

Boosting HMM acoustic models in large
vocabulary speech recognition

Carsten Meyer, Hauke Schramm

Philips Research Laboratories, Germany


SPEECH COMMUNICATION 2006

26

Utterance approach for boosting in ASR


An intuitive way of applying boosting to HMM speech
recognition is at the utterance level


Thus, boosting is used to improve upon an initial ranking of candidate
word sequences



The utterance approach has two advantages:


First, it is directly related to the sentence error rate


Second, it is computationally much less expensive than boosting applied
at the level of feature vectors

27

Utterance approach for boosting in ASR


In utterance approach, we define the input patterns to be
the sequence of feature vectors corresponding to the entire
utterance



denotes one possible candidate word sequence of the
speech recognizer, being the correct word sequence for
utterance


The a posteriori confidence measure is calculated on basis of
the
N
-
best list for utterance

i
x
i
i
y
i
y
i
i
L













i
L
z
i
t
i
t
i
t
z
p
z
x
p
y
p
y
x
p
y
x
h




,
28

Utterance approach for boosting in ASR


Based on the confidence values and AdaBoost.M2 algorithm,
we calculate an utterance weight for each training
utterance


Subsequently, the weight are used in maximum likelihood and
discriminative training of Gaussian mixture model

i
)
(
i
w
t









N
i
i
i
t
t
ML
y
x
p
i
w
F
1
,
log
















N
i
y
i
i
i
t
t
MMI
y
p
y
x
p
y
x
p
i
w
F
1
,
log



29

Utterance approach for boosting in ASR


Some problem encountered when apply it to large
-
scale
continuous speech application:


The
N
-
best lists of reasonable length (e.g.
N
=100) generally contain
only a tiny fraction of the possible classification results



This has two consequences:


In training, it may lead to sub
-
optimal utterance weights


In recognition, Eq. (1) cannot be applied appropriately














i
L
z
i
t
i
t
i
t
z
p
z
x
p
y
p
y
x
p
y
x
h




,
)
,
(
1
ln
max
arg
)
(
1
y
x
h
x
H
t
T
t
t
Y
y













30

Utterance approach for CSR
--
Training


Training


A convenient strategy to reduce the complexity of the classification task
and to provide more meaningful
N
-
best lists consists in “chopping” of the
training data



For long sentences, it simply means to insert additional sentence break
symbols at silence intervals with a given minimum length



This reduces the number of possible classifications of each sentence
“fragment”, so that the resulting
N
-
best lists should cover a sufficiently
large fraction of hypotheses


31

Utterance approach for CSR
--
Decoding


Decoding: lexical approach for model combination


A single pass decoding setup, where the combination of the boosted
acoustic models is realized at a
lexical level



The basic idea is to add a new pronunciation model by “replicating” the
set of phoneme symbols in each boosting iteration (e.g. by appending
the suffix “_
t
” to the phoneme symbol)



The new phoneme symbols, represent the underlying acoustic model of
boosting iteration


t
“au”, “au_1” ,“au_2”,…

t
32

Utterance approach for CSR
--
Decoding


Decoding: lexical approach for model combination (cont.)


Add to each phonetic transcription in the decoding lexicon a new
transcription using the corresponding phoneme set



Use the
reweighted

training data to train the boosted classifier



Decoding is then performed using the extended lexicon and the set of
acoustic models
weighted by their unigram prior probabilities

which
are estimated on the training data

t
M
“sic_a”, “sic_1 a_1” ,…

t
weighted summation

33

In more detail

Boosting

Iteration t

“_t”

M
t

Training

corpus

training

corpus(
M
t
)

phonetically

transcribed

)
(
i
w
t
ML/MMI

training

M
1
,M
2
,…,M
t

Lexicon

pronunciation

variant

extend

Training

“sic_a”, “sic_1 a_1” ,…

unweighted model combination

weighted model combination









































)
(
1
1
1
)
(
1
1
1
1
1
)
(
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
,
,
,
where
,
max
arg
max
arg
N
N
N
N
N
N
N
N
w
R
v
N
i
i
i
i
i
i
m
i
i
w
R
v
N
M
N
N
N
w
R
v
M
N
N
M
N
M
N
w
M
N
w
N
v
x
p
w
v
p
w
w
p
v
x
p
w
v
p
w
p
x
v
w
p
x
w
p
x
w
p
x
w
p
w

Decoding

34

In more detail

35

Weighted model combination





















i
i
N
N
N
N
N
N
t
t
i
T
t
w
R
v
N
i
i
i
i
i
i
i
i
i
m
i
i
T
t
w
R
v
N
M
N
N
N
T
t
w
R
v
M
N
N
M
N
t
p
t
v
x
p
t
p
t
w
v
p
w
w
p
t
v
x
p
t
w
v
p
t
w
p
x
t
v
w
p
x
w
p


1
ln
,
simplicity
for
,
,
,
,
,
,
,
,
1
)
(
1
1
1
1
)
(
1
1
1
1
1
1
)
(
1
1
1
1
1
1
1
1
1
1
1




























Word level model combination

36

Experiments


Isolated word recognition


Telephone
-
bandwidth large vocabulary isolated word recognition


SpeechDat(II) German meterial



Continuous speech recognition


Professional dictation and Switchboard


37

Isolated word recognition


Database:


Training corpus: consists of 18k utterances (4.3h) of city, company, first
and family names


Evaluations:


LILI

test corpus: 10k single word utterances (3.5h); 10k words lexicon;
(matched conditions)


Names

corpus: an inhouse collection of 676 utterances (0.5h); two different
decoding lexica: 10k lex, 190k lex; (acoustic conditions are matched,
whereas there is a lexical mismatch)


Office

corpus: 3.2k utterances (1.5h), recorded over microphone in clean
conditions; 20k lexicon; (an acoustic mismatch to the training conditions)


38

Isolated word recognition


Boosting ML models


39

Isolated word recognition


Combining boosting and discriminative training










The experiments in isolated word recognition showed that boosting may
improve the best test error rates

40

Continuous speech recognition


Database


Professional dictation


An inhouse data collection of real
-
life recordings of medical reports


The acoustic training corpus consists of about 58h of data


Evaluations were carried out on two test corpora:


Development corpus consists of 5.0h of speech


Evaluation corpus consists of 3.3h of speech


Switchboard


Consisting of spontaneous conversations recorded over telephone line;
57h(73h) of male(female)


Evaluations corpus:


Containing about 1h(0.5h) of male(female)

41

Continuous speech recognition


Professional dictation:


42


Switchboard:

43

Conclusions


In this paper, a boosting approach which can be applied to
any HMM based speech recognizer was be presented and
evaluated



The increased recognizer complexity and thus decoding effort
of the boosted systems is a major drawback compared to
other training techniques like discriminative training


44

Probably Approximately Correct Learning


We would like our hypothesis to be approximately correct,
namely, that
the error probability be bounded by some value



We also would like to be confident in our hypothesis in that we
want to know that our hypothesis will be correct most of the
time, so we want to be probably correct as well



Given a class, , and examples drawn from some unknown
but fixed probability distribution, such that with probability
at least , the hypothesis has error at most , for
arbitrary and

C
)
(
x
p


1
h

2
1


0










1
h
C
P
45

Probably Approximately Correct Learning


How many training examples
N

should we have, such that with
probability at least

1 ‒ δ,
h

has
error at most

ε ?

most general hypothesis,
G

most specific hypothesis,
S

h


H
, between
S

and
G

is

consistent



and make up the

version space


Each strip is at most ε/4


Pr that we miss a strip 1‒ ε/4


Pr that
N

instances miss a strip (1 ‒ ε/4)
N


Pr that
N

instances miss 4 strips 4(1 ‒
ε
/4)
N


4(1 ‒ ε/4)
N

≤ δ and (1 ‒ x)≤exp( ‒ x)


4exp(‒ ε
N
/4) ≤ δ and
N

≥ (4/ε)log(4/δ)