Automatic Speech Recognition

lettuceescargatoireΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

82 εμφανίσεις

Automatic Speech Recognition

Introduction

Readings: Jurafsky & Martin 7.1
-
2

HLT Survey Chapter 1

The Human Dialogue System

The Human Dialogue System

Computer Dialogue Systems

Audition

Automatic

Speech

Recognition

Natural

Language

Understanding

Dialogue

Management

Planning

Natural

Language

Generation

Text
-
to
-

speech

signal

words

logical form

words

signal

signal

Computer Dialogue Systems

Audition

ASR

NLU

Dialogue

Mgmt.

Planning

NLG

Text
-
to
-

speech

signal

words

logical form

words

signal

signal

Parameters of ASR Capabilities


Different types of tasks with different difficulties


Speaking mode (isolated words/continuous speech)


Speaking style (read/spontaneous)


Enrollment (speaker
-
independent/dependent)


Vocabulary (small < 20 wd/large >20kword)


Language model (finite state/context sensitive)


Perplexity (small < 10/large >100)


Signal
-
to
-
noise ratio (high > 30 dB/low < 10dB)


Transducer (high quality microphone/telephone)

The Noisy Channel Model

message

noisy channel

message

Message

Channel

+

=Signal

Decoding model: find Message*= argmax P(Message|Signal)

But how do we represent each of these things?

ASR using HMMs


Try to solve P(Message|Signal) by breaking the
problem up into separate components


Most common method:
Hidden Markov Models


Assume that a message is composed of words


Assume that words are composed of sub
-
word parts
(phones)


Assume that phones have some sort of acoustic
realization


Use probabilistic models for matching acoustics to
phones to words

HMMs: The Traditional View

go

home

g

o

h

o

m

x
0

x
1

x
2

x
3

x
4

x
5

x
6

x
7

x
8

x
9

Markov model

backbone composed

of phones

(hidden because we

don’t know

correspondences)

Acoustic observations

Each line represents a probability estimate (more later)

HMMs: The Traditional View

go

home

g

o

h

o

m

x
0

x
1

x
2

x
3

x
4

x
5

x
6

x
7

x
8

x
9

Markov model

backbone composed

of phones

(hidden because we

don’t know

correspondences)

Acoustic observations

Even with same word hypothesis, can have different alignments.

Also, have to search over all word hypotheses

HMMs as Dynamic Bayesian
Networks

go

home

q
0
=g

x
0

x
1

x
2

x
3

x
4

x
5

x
6

x
7

x
8

x
9

Markov model

backbone composed

of phones

Acoustic observations

q
1
=o

q
2
=o

q
3
=o

q
4
=h

q
5
=o

q
6
=o

q
7
=o

q
8
=m

q
9
=m

HMMs as Dynamic Bayesian
Networks

go

home

q
0
=g

x
0

x
1

x
2

x
3

x
4

x
5

x
6

x
7

x
8

x
9

Markov model

backbone composed

of phones

q
1
=o

q
2
=o

q
3
=o

q
4
=h

q
5
=o

q
6
=o

q
7
=o

q
8
=m

q
9
=m

ASR: What is best assignment to q
0
…q
9

given x
0
…x
9
?

Hidden Markov Models & DBNs

DBN representation

Markov Model
representation

Parts of an ASR System

Feature

Calculation




Language

Modeling




Acoustic

Modeling




k

@

Pronunciation

Modeling





cat: k@t

dog: dog

mail: mAl

the: D&, DE




cat dog: 0.00002

cat the: 0.0000005

the cat: 0.029

the dog: 0.031

the mail: 0.054




The cat chased the dog

S E A R C H

Parts of an ASR System

Feature

Calculation




Language

Modeling




Acoustic

Modeling




k

@

Pronunciation

Modeling





cat: k@t

dog: dog

mail: mAl

the: D&, DE




cat dog: 0.00002

cat the: 0.0000005

the cat: 0.029

the dog: 0.031

the mail: 0.054




Produces

acoustics (x
t
)

Maps acoustics

to phones

Maps phones

to words

Strings words

together

Feature calculation

Feature calculation

Frequency

Time

Find energy at each time step in

each frequency channel

Feature calculation

Frequency

Time

Take inverse Discrete Fourier

Transform to decorrelate frequencies

Feature calculation

-
0.1

0.3

1.4

-
1.2

2.3

2.6



0.2

0.1

1.2

-
1.2

4.4

2.2



-
6.1

-
2.1

3.1

2.4

1.0

2.2



0.2

0.0

1.2

-
1.2

4.4

2.2





Input:

Output:

Robust Speech Recognition


Different schemes have been developed for
dealing with noise, reverberation


Additive noise: reduce effects of particular
frequencies


Convolutional noise: remove effects of linear
filters (cepstral mean subtraction)

Now what?

-
0.1

0.3

1.4

-
1.2

2.3

2.6



0.2

0.1

1.2

-
1.2

4.4

2.2



-
6.1

-
2.1

3.1

2.4

1.0

2.2



0.2

0.0

1.2

-
1.2

4.4

2.2



That you …

???

Machine Learning!

-
0.1

0.3

1.4

-
1.2

2.3

2.6



0.2

0.1

1.2

-
1.2

4.4

2.2



-
6.1

-
2.1

3.1

2.4

1.0

2.2



0.2

0.0

1.2

-
1.2

4.4

2.2



That you …

Pattern recognition

with HMMs

Hidden Markov Models (again!)

P(acoustics
t
|state
t
)

Acoustic Model

P(state
t+1
|state
t
)

Pronunciation/Language models

Acoustic Model

-
0.1

0.3

1.4

-
1.2

2.3

2.6



0.2

0.1

1.2

-
1.2

4.4

2.2



-
6.1

-
2.1

3.1

2.4

1.0

2.2



0.2

0.0

1.2

-
1.2

4.4

2.2



dh

a

a

t


Assume that you can
label each vector with
a phonetic label



Collect all of the
examples of a phone
together and build a
Gaussian model (or
some other statistical
model, e.g. neural
networks)

N
a
(
m,S
)

P(X|state=a)

Building up the Markov Model


Start with a model for each phone




Typically, we use 3 states per phone to give
a minimum duration constraint, but ignore
that here…

a

p

1
-
p

transition probability

a

p

1
-
p

a

p

1
-
p

a

p

1
-
p

Building up the Markov Model


Pronunciation model gives connections
between phones and words




Multiple pronunciations:

ow

t

m

dh

p
dh

1
-
p
dh

a

p
a

1
-
p
a

t

p
t

1
-
p
t

ah

ow

ey

ah

t

Building up the Markov Model


Language model gives connections between
words (e.g., bigram grammar)




dh

a

t

h

iy

y

uw

p(he|that)

p(you|that)

ASR as Bayesian Inference

q
1
w
1

q
2
w
1

q
3
w
1

x
1

x
2

x
3

th

a

t

h

iy

y

uw

p(he|that)

p(you|that)

h

iy

sh

uh

d

argmax
W

P(W|X)

=argmax
W

P(X|W)P(W)/P(X)

=argmax
W

P(X|W)P(W)

=argmax
W

S
Q

P(X,Q|W)P(W)

≈argmax
W

max
Q

P(X,Q|W)P(W)

≈argmax
W

max
Q

P(X|Q) P(Q|W) P(W)


ASR Probability Models


Three probability models


P(X|Q): acoustic model


P(Q|W): duration/transition/pronunciation
model


P(W): language model


language/pronunciation models inferred
from prior knowledge


Other models learned from data (how?)

Parts of an ASR System

Feature

Calculation




Language

Modeling




Acoustic

Modeling




k

@

Pronunciation

Modeling




cat: k@t

dog: dog

mail: mAl

the: D&, DE



cat dog: 0.00002

cat the: 0.0000005

the cat: 0.029

the dog: 0.031

the mail: 0.054




The cat chased the dog

S E A R C H

P(X|Q)

P(Q|W)

P(W)

EM for ASR: The Forward
-
Backward Algorithm


Determine “state occupancy” probabilities


I.e. assign each data vector to a state


Calculate new transition probabilities, new
means & standard deviations (emission
probabilities) using assignments

ASR as Bayesian Inference

q
1
w
1

q
2
w
1

q
3
w
1

x
1

x
2

x
3

th

a

t

h

iy

y

uw

p(he|that)

p(you|that)

h

iy

sh

uh

d

argmax
W

P(W|X)

=argmax
W

P(X|W)P(W)/P(X)

=argmax
W

P(X|W)P(W)

=argmax
W

S
Q

P(X,Q|W)P(W)

≈argmax
W

max
Q

P(X,Q|W)P(W)

≈argmax
W

max
Q

P(X|Q) P(Q|W) P(W)


Search


When trying to find W*=argmax
W

P(W|X), need
to look at (in theory)


All possible word sequences W


All possible segmentations/alignments of W&X


Generally, this is done by searching the space of
W


Viterbi search: dynamic programming approach that
looks for the most likely path


A* search: alternative method that keeps a stack of
hypotheses around


If |W| is large, pruning becomes important

How to train an ASR system


Have a speech corpus at hand


Should have word (and preferrably phone)
transcriptions


Divide into training, development, and test sets


Develop models of prior knowledge


Pronunciation dictionary


Grammar


Train acoustic models


Possibly realigning corpus phonetically

How to train an ASR system


Test on your development data (baseline)


**Think real hard


Figure out some neat new modification


Retrain system component


Test on your development data


Lather, rinse, repeat **


Then, at the end of the project, test on the test
data.

Judging the quality of a system


Usually, ASR performance is judged by the
word error rate

ErrorRate = 100*(Subs + Ins + Dels) / Nwords


REF: I WANT TO GO HOME ***

REC: * WANT TWO GO HOME NOW

SC: D C S C C I

100*(1S+1I+1D)/5 = 60%

Judging the quality of a system


Usually, ASR performance is judged by the
word error rate


This assumes that all errors are equal


Also, a bit of a mismatch between optimization
criterion and error measurement


Other (task specific) measures sometimes
used


Task completion


Concept error rate