Presented by: Fang

Hui Chu
A Survey of Boosting HMM Acoustic Model
Training
2
Introduction
•
The
No Free Lunch Theorem
states that
–
There is no single learning algorithm that in any domain always
induces the most accurate learner
•
Learning is an
ill

posed problem
and with finite data, each
algorithm converges to a different solution and fails under
different circumstances
–
Though the performance of a learner may be fine

tuned, but still there
are instances on which even the best learner is not accurate enough
•
The idea is..
–
There may be another learner that is accurate on these instances
–
By suitably combining multiple learners then, accuracy can be improved
3
Introduction
•
Since there is no point in combining learners that always
make similar decisions
–
The aim is to be able to find a set of
base

learners
who differ in their
decisions so that they will complement each other
•
There are different ways the multiple base

learners are
combined to generate the final outputs:
–
Multiexpert
combination
methods
•
Voting
and its variants
•
Mixture of experts
•
Stacked generalization
–
Multistage
combination
methods
•
Cascading
4
Voting
•
The simplest way to combine multiple classifiers
–
which corresponds to taking a linear combination of the learners
–
this is also known as
ensembles
and
linear opinion pools
–
The name voting comes from its use in classification
•
if , called
plurality voting
•
if , called
majority voting
L
j
j
j
d
w
y
1
1
and
,
0
1
L
j
j
j
w
j
w
L
j
ji
j
i
d
w
y
1
K
i
K
,...,
1
outputs,
are
there
,
L
w
j
1
L
w
K
j
1
,
2
5
Bagging
•
Bagging is a voting method whereby base

learners are made
different by training them over slightly different training sets
–
is done by
bootstrap
•
where given a training set
X
of size
N
, we draw
N
instances randomly from
X
with replacement
•
In bagging, generating complementary base

learners is left to
chance and to the instability of the learning method
–
A learning algorithm is an
unstable algorithm
if small changes in the
training set causes a large difference in the generated learner
•
decision trees, multilayer perceptrons, condensed nearest neighbor
•
Bagging is short for
Bootstrap aggregating
Breiman, L. 1996.
“
Bagging Predictors.
”
Machine Learning
26, 123

140
6
Boosting
•
In boosting, we actively try to generate complementary base

learners by training the next learner on the mistakes of the
previous learners
•
The original boosting algorithms
(Schapire 1990)
combines three
weak learners
to generate a
strong learner
–
In the sense of the
probably approximately correct
(PAC) learning
model
•
Disadvantage
–
It requires a very large training sample
Schapire, R.E. 1990.
“
The Strength of Weak Learnability.
”
Machine Learning
5, 197

227
X
3
X
1
X
2
d
3
d
2
d
1
7
AdaBoost
•
AdaBoost, short for adaptive boosting, uses the same
training
set
over and over and thus
need not be large
and it
can also
combine an arbitrary number of base

learners
, not three
•
The idea is to modify the probabilities of drawing the instances
as a function of the error
–
The probability of a correctly classified instance is decreased, then a
new sample set is drawn from the original sample according to these
modified probabilities
–
That focuses more on instances misclassified by previous learner
•
Schapire
et al.
explain that the success of AdaBoost is due to
its property of increasing the
margin
–
Schapire.
et al.
1998.
“
Boosting the Margin: A New Explanation for Effectiveness of Voting Methods
”
Annals of Statistics
26, 1651

1686
Freund and Schapire. 1996.
“
Experiments with a New Boosting Algorithm
”
In
ICML
13, 148

156
8
AdaBoost.M2
(Freund and Schapire, 1997)
stop
;
2
1
if
j
Freund and Schapire. 1997.
“
A decision

theoretic generalization of on

line learning and an application
to boosting
”
Journal of Computer and System Sciences
55, 119

139
9
ICSLP 04
R. Zhang & A. Rudnicky
“A Frame Level Boosting
Training Scheme for Acoustic
Modeling”
Evolution of Boosting Algo.
ICSLP 96
G. Cook & T. Robinson
“Boosting the Performance of
Connectionist LVSR”
B
ICASSP 00
G. Zweig & M. Padmanabhan
“Boosting Gaussian Mixtures in An
LVCSR System”
E
1996
2000
ICASSP 03
R. Zhang & A. Rudnicky
“Improving the Performance of An
LVCSR System Through Ensembles
of Acoustic Models”
2003
ICASSP 04
C. Dimitrakakis & S. Bengio
“Boosting HMMs with An
Application to Speech Recognition”
ICSLP 04
R. Zhang & A. Rudnicky
“Optimizing Boosting with
Discriminative Criteria”
ICSLP 04
R. Zhang & A. Rudnicky
“Apply N

Best List Re

Ranking to
Acoustic Model Combinations of
Boosting Training”
EuroSpeech 05
R. Zhang et al.
“Investigations on Ensemble
Based Semi

Supervised
Acoustic Model Training”
ICSLP 06
R. Zhang & A. Rudnicky
“Investigations of Issues for
Using Multiple Acoustic
Models to Improve CSR”
EuroSpeech 03
R. Zhang & A. Rudnicky
“Comparative Study of Boosting
and Non

Boosting Training for
Constructing Ensembles of Acoustic
Models”
2004
2005
2006
0
1
2
3
4
5
6
ICASSP 02
C. Meyer
“Utterance

Level Boosting of HMM
Speech Recognition”
D
SpeechCom 06
C. Meyer & H. Schramm
“Boosting HMM Acoustic
Models in LVCSR”
ICASSP 02
I. Zitouni et al.
“Combination of Boosting and
Discriminative Training for Natural
Language Call Steering Systems”
A
2002
ICASSP 99
H. Schwenk
“Using Boosting to Improve a Hybrid
HMM/Neural Network Speech
Recognizer”
1999
Neural Network
EuroSpeech 97
G. Cook et al.
“Ensemble Methods for
Connectionist Acoustic
Modeling”
1997
GMM
HMM
Presented by: Fang

Hui Chu
Improving The Performance of An LVCSR
System Through Ensembles of Acoustic Models
ICASSP 2003
Rong Zhang and Alexander I. Rudnicky
Language Technologies Institute,
School of Computer Science
Carnegie Mellon University
11
Bagging vs. Boosting
•
Bagging
–
In each round, bagging randomly selects a number of examples from
the original training set, and produces a new single classifier based on
the selected subset
–
The final classifier is built by choosing the hypothesis best agreed on by
single classifiers
•
Boosting
–
In boosting, the single classifiers are iteratively trained in a fashion such
that hard

to

classify examples are given increasing emphasis
–
A parameter that measures the classifier’s importance is determined in
respect of its classification accuracy
–
The final hypothesis is the weighted majority vote from the single
classifiers
12
Algorithms
•
The first algorithm is based on the intuition that an incorrectly
recognized utterance should receive more attention in training
•
If the weight of an utterance is 2.6, we first add two copies of the utterance
to the new training set, and then add its third copy with probability 0.6
13
Algorithms
•
The exponential increase
in the size of training set
is a severe problem for
algorithm 1
•
Algorithm 2 is proposed
to address this problem
14
Algorithms
•
In algorithm 1 and 2, there is
no concern to measure how
important a model is relative
to others
–
Good model should play more
important role than bad one
x
x
1
1
exp
i
T
t
i
t
t
e
c
L
otherwise
if
e
t
i
t
i
i
t
0
x
x
x
x
x
x
1
1
1
1
1
exp
exp
exp
i
i
T
T
T
i
i
i
T
T
T
t
i
t
t
e
c
w
e
c
e
c
L
15
Experiments
•
Corpus : CMU Communicator system
•
Experimental results :
Presented by: Fang

Hui Chu
Comparative Study of Boosting and Non

Boosting Training for Constructing Ensembles
of Acoustic Models
Rong Zhang and Alexander I. Rudnicky
Language Technologies Institute, CMU
EuroSpeech 2003
17
Non

Boosting method
•
Bagging
–
is a commonly used method in machine learning field
–
randomly selects a number of examples from the original training set
and produces a new single classifier
–
in this paper, we call it a non

Boosting method
•
Based on the intuition
–
The misrecognized utterance should receive more attention in the
successive training
18
Algorithms
λ is a parameter that prevents the size of
the training set from being too large.
19
Experiments
•
The corpus:
–
Training set: 31248 utterances; Test set: 1689 utterances
Presented by: Fang

Hui Chu
A Frame Level Boosting Training Scheme for
Acoustic Modeling
ICSLP 2004
Rong Zhang and Alexander I. Rudnicky
Language Technologies Institute,
School of Computer Science
Carnegie Mellon University
21
Introduction
•
In the current Boosting algorithm,
utterance
is the basic unit
used for acoustic model training
•
Our analysis shows that there are two notable weaknesses in
this setting..
–
First, the objective function of current Boosting algorithm is designed to
minimize utterance error instead of word error
–
Second, in the current algorithm, an utterance is treated as a unity for resample
•
This paper proposes a frame level Boosting training scheme
for acoustic modeling to address these two problems
22
is the pseudo loss for frame
t
, which describes
the
degree of confusion of this frame
for recognition
Frame Level Boosting Training Scheme
•
The metrics that we will use in Boosting training is the frame
level conditional probability

(word level)
•
Objective function :
x

t
a
P
NBest
t
t
NBest
h
a
h
t
t
h
P
h
P
P
a
P
a
P
x
x
x
x
x
x
x
,
,
,

label
,
N
i
T
t
a
a
i
i
i
i
t
a
P
a
P
L
1
1


exp
x
x
t
a
a
i
i
i
a
P
a
P
x
x


exp
23
Frame Level Boosting Training Scheme
•
Training Scheme:
–
How to resample the frame
level training data?
•
to duplicate for times
and creates a new utterance
for acoustic model training
t
i
,
x
t
i
,
x
24
Experiments
•
Corpus : CMU Communicator system
•
Experimental results :
Presented by: Fang

Hui Chu
Boosting HMM acoustic models in large
vocabulary speech recognition
Carsten Meyer, Hauke Schramm
Philips Research Laboratories, Germany
SPEECH COMMUNICATION 2006
26
Utterance approach for boosting in ASR
•
An intuitive way of applying boosting to HMM speech
recognition is at the utterance level
–
Thus, boosting is used to improve upon an initial ranking of candidate
word sequences
•
The utterance approach has two advantages:
–
First, it is directly related to the sentence error rate
–
Second, it is computationally much less expensive than boosting applied
at the level of feature vectors
27
Utterance approach for boosting in ASR
•
In utterance approach, we define the input patterns to be
the sequence of feature vectors corresponding to the entire
utterance
•
denotes one possible candidate word sequence of the
speech recognizer, being the correct word sequence for
utterance
•
The a posteriori confidence measure is calculated on basis of
the
N

best list for utterance
i
x
i
i
y
i
y
i
i
L
i
L
z
i
t
i
t
i
t
z
p
z
x
p
y
p
y
x
p
y
x
h
,
28
Utterance approach for boosting in ASR
•
Based on the confidence values and AdaBoost.M2 algorithm,
we calculate an utterance weight for each training
utterance
•
Subsequently, the weight are used in maximum likelihood and
discriminative training of Gaussian mixture model
i
)
(
i
w
t
N
i
i
i
t
t
ML
y
x
p
i
w
F
1
,
log
N
i
y
i
i
i
t
t
MMI
y
p
y
x
p
y
x
p
i
w
F
1
,
log
29
Utterance approach for boosting in ASR
•
Some problem encountered when apply it to large

scale
continuous speech application:
–
The
N

best lists of reasonable length (e.g.
N
=100) generally contain
only a tiny fraction of the possible classification results
•
This has two consequences:
–
In training, it may lead to sub

optimal utterance weights
–
In recognition, Eq. (1) cannot be applied appropriately
i
L
z
i
t
i
t
i
t
z
p
z
x
p
y
p
y
x
p
y
x
h
,
)
,
(
1
ln
max
arg
)
(
1
y
x
h
x
H
t
T
t
t
Y
y
30
Utterance approach for CSR

Training
•
Training
–
A convenient strategy to reduce the complexity of the classification task
and to provide more meaningful
N

best lists consists in “chopping” of the
training data
–
For long sentences, it simply means to insert additional sentence break
symbols at silence intervals with a given minimum length
–
This reduces the number of possible classifications of each sentence
“fragment”, so that the resulting
N

best lists should cover a sufficiently
large fraction of hypotheses
31
Utterance approach for CSR

Decoding
•
Decoding: lexical approach for model combination
–
A single pass decoding setup, where the combination of the boosted
acoustic models is realized at a
lexical level
–
The basic idea is to add a new pronunciation model by “replicating” the
set of phoneme symbols in each boosting iteration (e.g. by appending
the suffix “_
t
” to the phoneme symbol)
–
The new phoneme symbols, represent the underlying acoustic model of
boosting iteration
t
“au”, “au_1” ,“au_2”,…
t
32
Utterance approach for CSR

Decoding
•
Decoding: lexical approach for model combination (cont.)
–
Add to each phonetic transcription in the decoding lexicon a new
transcription using the corresponding phoneme set
–
Use the
reweighted
training data to train the boosted classifier
–
Decoding is then performed using the extended lexicon and the set of
acoustic models
weighted by their unigram prior probabilities
which
are estimated on the training data
t
M
“sic_a”, “sic_1 a_1” ,…
t
weighted summation
33
In more detail
Boosting
Iteration t
“_t”
M
t
Training
corpus
training
corpus(
M
t
)
phonetically
transcribed
)
(
i
w
t
ML/MMI
training
M
1
,M
2
,…,M
t
Lexicon
pronunciation
variant
extend
Training
“sic_a”, “sic_1 a_1” ,…
unweighted model combination
weighted model combination
)
(
1
1
1
)
(
1
1
1
1
1
)
(
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
,
,
,
where
,
max
arg
max
arg
N
N
N
N
N
N
N
N
w
R
v
N
i
i
i
i
i
i
m
i
i
w
R
v
N
M
N
N
N
w
R
v
M
N
N
M
N
M
N
w
M
N
w
N
v
x
p
w
v
p
w
w
p
v
x
p
w
v
p
w
p
x
v
w
p
x
w
p
x
w
p
x
w
p
w
Decoding
34
In more detail
35
Weighted model combination
i
i
N
N
N
N
N
N
t
t
i
T
t
w
R
v
N
i
i
i
i
i
i
i
i
i
m
i
i
T
t
w
R
v
N
M
N
N
N
T
t
w
R
v
M
N
N
M
N
t
p
t
v
x
p
t
p
t
w
v
p
w
w
p
t
v
x
p
t
w
v
p
t
w
p
x
t
v
w
p
x
w
p
1
ln
,
simplicity
for
,
,
,
,
,
,
,
,
1
)
(
1
1
1
1
)
(
1
1
1
1
1
1
)
(
1
1
1
1
1
1
1
1
1
1
1
•
Word level model combination
36
Experiments
•
Isolated word recognition
–
Telephone

bandwidth large vocabulary isolated word recognition
–
SpeechDat(II) German meterial
•
Continuous speech recognition
–
Professional dictation and Switchboard
37
Isolated word recognition
•
Database:
–
Training corpus: consists of 18k utterances (4.3h) of city, company, first
and family names
–
Evaluations:
•
LILI
test corpus: 10k single word utterances (3.5h); 10k words lexicon;
(matched conditions)
•
Names
corpus: an inhouse collection of 676 utterances (0.5h); two different
decoding lexica: 10k lex, 190k lex; (acoustic conditions are matched,
whereas there is a lexical mismatch)
•
Office
corpus: 3.2k utterances (1.5h), recorded over microphone in clean
conditions; 20k lexicon; (an acoustic mismatch to the training conditions)
38
Isolated word recognition
•
Boosting ML models
39
Isolated word recognition
•
Combining boosting and discriminative training
–
The experiments in isolated word recognition showed that boosting may
improve the best test error rates
40
Continuous speech recognition
•
Database
–
Professional dictation
•
An inhouse data collection of real

life recordings of medical reports
•
The acoustic training corpus consists of about 58h of data
•
Evaluations were carried out on two test corpora:
–
Development corpus consists of 5.0h of speech
–
Evaluation corpus consists of 3.3h of speech
–
Switchboard
•
Consisting of spontaneous conversations recorded over telephone line;
57h(73h) of male(female)
•
Evaluations corpus:
–
Containing about 1h(0.5h) of male(female)
41
Continuous speech recognition
•
Professional dictation:
42
•
Switchboard:
43
Conclusions
•
In this paper, a boosting approach which can be applied to
any HMM based speech recognizer was be presented and
evaluated
•
The increased recognizer complexity and thus decoding effort
of the boosted systems is a major drawback compared to
other training techniques like discriminative training
44
Probably Approximately Correct Learning
•
We would like our hypothesis to be approximately correct,
namely, that
the error probability be bounded by some value
•
We also would like to be confident in our hypothesis in that we
want to know that our hypothesis will be correct most of the
time, so we want to be probably correct as well
•
Given a class, , and examples drawn from some unknown
but fixed probability distribution, such that with probability
at least , the hypothesis has error at most , for
arbitrary and
C
)
(
x
p
1
h
2
1
0
1
h
C
P
45
Probably Approximately Correct Learning
•
How many training examples
N
should we have, such that with
probability at least
1 ‒ δ,
h
has
error at most
ε ?
most general hypothesis,
G
most specific hypothesis,
S
h
H
, between
S
and
G
is
consistent
and make up the
version space
•
Each strip is at most ε/4
•
Pr that we miss a strip 1‒ ε/4
•
Pr that
N
instances miss a strip (1 ‒ ε/4)
N
•
Pr that
N
instances miss 4 strips 4(1 ‒
ε
/4)
N
•
4(1 ‒ ε/4)
N
≤ δ and (1 ‒ x)≤exp( ‒ x)
•
4exp(‒ ε
N
/4) ≤ δ and
N
≥ (4/ε)log(4/δ)
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο