spectacularscarecrowAI and Robotics

Nov 17, 2013 (3 years and 6 months ago)


Introduction to Automatic
Speech Recognition


Define the problem

What is speech?

Feature Selection


Early methods

Modern statistical models

Current State of ASR

Future Work

The ASR Problem

There is no single ASR problem

The problem depends on many factors

Microphone: Close
mic, throat
mic, microphone
array, audio

Sources: band
limited, background noise,

Speaker: speaker dependent, speaker

Language: open/closed vocabulary, vocabulary
size, read/spontaneous speech

Output: Transcription, speaker id, keywords

Performance Evaluation


Percentage of tokens correctly recognized

Error Rate

Inverse of accuracy

Token Type





What is Speech?

Analog signal produced by humans

You can think about the speech signal being
decomposed into the source and filter

The source is the vocal folds in voiced speech

The filter is the vocal tract and articulators

Speech Production

Speech Production

Speech Production

Speech Visualization

Speech Visualization

Speech Visualization

Feature Selection

As in any data
driven task, the data must be
represented in some format

Cepstral features have been found to perform

They represent the frequency of the

frequency cepstral coefficients (MFCC)
are the most common variety

Where do we stand?

Defined the multiple problems associated with

Described how speech is produced

Illustrated how speech can be represented in
an ASR system

Now that we have the data, how do we
recognize the speech?

Radio Rex

First known attempt at speech recognition

A toy from 1922

Worked by analyzing the signal strength at

Actual speech recognition

Originally thought to be a relatively simple
task requiring a few years of concerted effort

1969, “Wither speech recognition” is

A DARPA project ran from 1971
1976 in
response to the statements in the Pierce

We can examine a few general systems

Based ASR

Originally only worked for isolated words

Performs best when training and testing
conditions are best

For each word we want to recognize, we
store a template or example based on actual

Each test utterance is checked against the
templates to find the best match

Uses the Dynamic Time Warping (DTW)

Dynamic Time Warping

Create a similarity matrix for the two

Use dynamic programming to find the lowest
cost path


One of the systems developed during the
DARPA program

A blackboard
based system utilizing symbolic
problem solvers

Each problem solver was called a knowledge

A complex scheduler was used to decide
when each KG should be called


DARPA Results

The Hearsay
II system performed much
better than the two other similar competing

However, only one system met the
performance goals of the project

The Harpy system was also a CMU built system

In many ways it was a predecessor to the
modern statistical systems

Modern Statistical ASR

Modern Statistical ASR

Acoustic Model

For each frame of data, we need some way
of describing the likelihood of it belonging to
any of our classes

Two methods are commonly used

Multilayer perceptron (MLP) gives the likelihood
of a class given the data

Gaussian Mixture Model (GMM) gives the
likelihood of the data given a class

Gaussian Distribution

Pronunciation Model

While the pronunciation model can be very
complex, it is typically just a dictionary

The dictionary contains the valid
pronunciations for each word


Cat: k ae t

Dog: d ao g

Fox: f aa x s

Language Model

Now we need some way of representing the
likelihood of any given word sequence

Many methods exist, but ngrams are the
most common

Ngrams models are trained by simply
counting the occurrences of words in a
training set


A unigram is the probability of any word in

A bigram is the probability of a given word
given the previous word

Higher order ngrams continue in a similar

A backoff probability is used for any unseen

How do we put it together?

We now have models to represent the three
parts of our equation

We need a framework to join these models

The standard framework used is the Hidden
Markov Model (HMM)

Markov Model

A state model using the markov property

The markov property states that the future
depends only on the present state

Models the likelihood of transitions between
states in a model

Given the model, we can determine the
likelihood of any sequence of states

Hidden Markov Model

Similar to a markov model except the states
are hidden

We now have observations tied to the
individual states

We no longer know the exact state sequence
given the data

Allows for the modeling of an underlying
unobservable process

HMMs for ASR

First we build an HMM for each phone

Next we combine the phone models based
on the pronunciation model to create word
level models

Finally, the word level models are combined
based on the language model

We now have a giant network with potentially
thousands or even millions of states


Decoding happens in the same way as the
previous example

For each time frame we need to maintain two
pieces of information

The likelihood of being at any state

The previous state for every state

State of the Art

What works well

Constrained vocabulary systems

Systems adapted to a given speaker

Systems in anechoic environments without
background noise

Systems expecting read speech

What doesn't work

Large unconstrained vocabulary

Noisy environments

Conversational speech

Future Work

Better representations of audio based on

Better representation of acoustic elements
based on articulatory phonology

Segmental models that do not rely on the
simple frame
based approach


Hidden Markov Model Toolkit (HTK)

CHIME ( a freely available dataset)

Machine Learning Lectures