Lecture 3 Speech Recognition - Parham Aarabi

movedearAI and Robotics

Nov 17, 2013 (3 years and 8 months ago)

74 views

1

© 2013 P. Aarabi

ECE1780 Lecture 3





Prof. Parham Aarabi

Speech Recognition

2

© 2013 P. Aarabi

SPEECH RECOGNITION

How can computers understand human
speech?


-
Over 60 years of research


-
Claimed to be solved almost every year


-
Basic speech recognition is doable (digits,
limited vocabulary, etc.), but full/complete
speech recognition has proven very difficult

3

© 2013 P. Aarabi

Time
Frequency
Transform

SPEECH RECOGNITION

4

© 2013 P. Aarabi

Frequency (
ω
)

Time segment index (k)

hel

lo

SPEECH RECOGNITION

5

© 2013 P. Aarabi

SPEECH RECOGNITION

How do we model speech patterns?


Cepstrum








d
e
X
t
x
t
j
)
(





dt
e
t
x
X
t
j


)
(







d
e
X
t
c
t
j
)
(
log
6

© 2013 P. Aarabi

SPEECH RECOGNITION

Why is the
Cepstrum

useful?










F
V
X


Voiced portion
(many harmonics)

Formant portion
(no harmonics)

7

© 2013 P. Aarabi

SPEECH RECOGNITION

Why is the
Cepstrum

useful?



























d
e
F
d
e
V
d
e
F
V
t
c
t
j
t
j
t
j
log
log
log
8

© 2013 P. Aarabi

SPEECH RECOGNITION

Why is the
Cepstrum

useful?















t
c
t
c
d
e
F
d
e
V
d
e
F
V
t
c
F
V
t
j
t
j
t
j


















log
log
log
Later
t

samples

(info about pitch)

Earlier
t

samples

(info about formants)

9

© 2013 P. Aarabi

SPEECH RECOGNITION

Why is the
Cepstrum

useful?


Compute the
Cepstrum
, and take the first few
samples that represent the formant structure


Learn a probabilistic model for the
Cepstral

Coefficients


Usually, the
Cepstrum

is computed on a perceptual
scale (i.e. Mel Scale)


The features extracted per sound window are then
the Mel
-
Frequency
Cepstral

Coefficients (MFCC)

10

© 2013 P. Aarabi

SPEECH RECOGNITION

The Mel Scale


Perceptually motivated frequency scale for audio









700
1
log
2595
f
m
0.5 kHz 1 kHz 2 kHz 3kHz

11

© 2013 P. Aarabi

SPEECH RECOGNITION

How do we model MFCC features?

Trained Language Model

MFCC Feature
Vector For Time
Window 1

MFCC Feature
Vector For Time
Window 2

MFCC Feature
Vector For Time
Window N

Output Word

12

© 2013 P. Aarabi

SPEECH RECOGNITION

How do we model MFCC features?

Trained Language Model

Neural Nets, Hidden Markov Model, Other Probabilistic
Model

13

© 2013 P. Aarabi

SPEECH RECOGNITION

Hidden
Markov Models

1

2

3

OUTPUT: 1 2 3 2 2 3 3 3 1 1 2 2 3 3 3 3 1

STATE:

1 2 3 2 2 3 3 3 1 1 2 2 3 3 3 3 1

14

© 2013 P. Aarabi

SPEECH RECOGNITION

Hidden

Markov Models

1

2

3

OUTPUT: R B G F H I A B E G A H Z E Q R

STATE:

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

A

D

E

H

Q

Z

B

C

F

G

R

S

15

© 2013 P. Aarabi

SPEECH RECOGNITION

Hidden

Markov Models

Goal: Estimate
State Transition
based on
output sequence.

Goal: Estimate
Speech Syllable
based on
observed MFCC sequence. Train HMM with
prior tagged speech data.


i.e. estimate “
Heh

Loh
” from sequence of
MFCCs

16

© 2013 P. Aarabi

SPEECH RECOGNITION


Relevant Papers:



[1] G
. Hinton, L. Deng, D. Yu, G. Dahl,
A.Mohamed
, N.
Jaitly
, A. Senior, V.
Vanhoucke
, P. Nguyen, T.
Sainath
, and B.
Kingsbury, Deep
Neural Networks for
Acoustic Modeling in Speech
Recognition, IEEE
Signal Processing Magazine,

29
,
November
2012


[2]
Rabiner
, L.R., Schafer, R.W., Introduction to Digital Speech Processing,
Foundations, and Trends, in Signal Processing, Vol. 1, no. 1
-
2, pp. 1
-
194, 2007.

17

© 2013 P. Aarabi

SPEECH RECOGNITION

The State of Speech Recognition?


Typical claims are at 90
-
95% word accuracy rate


Problem 1: Noise + accents significantly affect
the above rate


Problem 2: Assuming 90% word accuracy rate,
for a 10
-
word phrase, we have:


P(all correct phrase words) = 0.9
10

= 35%!!


95% word accuracy
-
> 60% phrase accuracy

18

© 2013 P. Aarabi

SPEECH RECOGNITION

The State of Speech Recognition?



Typical commercial systems: 40
-
50% real
phrase accuracy


Next generation systems (deep
-
neural
-
network
based): 60
-
70% real phrase accuracy


Humans: 90+% real phrase accuracy

19

© 2013 P. Aarabi

SPEECH RECOGNITION

Speech Recognition as a Mobile UI?



Great for short conversational tasks (setting
appointments, getting stock/score updates,
answering emails in private, etc.)


Not so great for writing documents, general
use in public, etc.

20

© 2013 P. Aarabi

SPEECH RECOGNITION