foxr/CSC425/NOTES/sr

spectacularscarecrowAI and Robotics

Nov 17, 2013 (3 years and 8 months ago)

71 views

Introduction to Automatic
Speech Recognition

Outline


Define the problem


What is speech?


Feature Selection


Models


Early methods


Modern statistical models


Current State of ASR


Future Work

The ASR Problem


There is no single ASR problem


The problem depends on many factors


Microphone: Close
-
mic, throat
-
mic, microphone
array, audio
-
visual


Sources: band
-
limited, background noise,
reverberation


Speaker: speaker dependent, speaker
independent


Language: open/closed vocabulary, vocabulary
size, read/spontaneous speech


Output: Transcription, speaker id, keywords

Performance Evaluation


Accuracy


Percentage of tokens correctly recognized


Error Rate


Inverse of accuracy


Token Type


Phones


Words*


Sentences


Semantics?

What is Speech?


Analog signal produced by humans


You can think about the speech signal being
decomposed into the source and filter


The source is the vocal folds in voiced speech


The filter is the vocal tract and articulators

Speech Production

Speech Production

Speech Production

Speech Visualization

Speech Visualization

Speech Visualization

Feature Selection


As in any data
-
driven task, the data must be
represented in some format


Cepstral features have been found to perform
well


They represent the frequency of the
frequencies


Mel
-
frequency cepstral coefficients (MFCC)
are the most common variety

Where do we stand?


Defined the multiple problems associated with
ASR


Described how speech is produced


Illustrated how speech can be represented in
an ASR system


Now that we have the data, how do we
recognize the speech?

Radio Rex


First known attempt at speech recognition


A toy from 1922


Worked by analyzing the signal strength at
500Hz

Actual speech recognition
systems


Originally thought to be a relatively simple
task requiring a few years of concerted effort


1969, “Wither speech recognition” is
published


A DARPA project ran from 1971
-
1976 in
response to the statements in the Pierce
article


We can examine a few general systems

Template
-
Based ASR


Originally only worked for isolated words


Performs best when training and testing
conditions are best


For each word we want to recognize, we
store a template or example based on actual
data


Each test utterance is checked against the
templates to find the best match


Uses the Dynamic Time Warping (DTW)
algorithm

Dynamic Time Warping


Create a similarity matrix for the two
utterances


Use dynamic programming to find the lowest
cost path

Hearsay
-
II


One of the systems developed during the
DARPA program


A blackboard
-
based system utilizing symbolic
problem solvers


Each problem solver was called a knowledge
group


A complex scheduler was used to decide
when each KG should be called

Hearsay
-
II

DARPA Results


The Hearsay
-
II system performed much
better than the two other similar competing
systems


However, only one system met the
performance goals of the project


The Harpy system was also a CMU built system


In many ways it was a predecessor to the
modern statistical systems

Modern Statistical ASR

Modern Statistical ASR

Acoustic Model


For each frame of data, we need some way
of describing the likelihood of it belonging to
any of our classes


Two methods are commonly used


Multilayer perceptron (MLP) gives the likelihood
of a class given the data


Gaussian Mixture Model (GMM) gives the
likelihood of the data given a class

Gaussian Distribution

Pronunciation Model


While the pronunciation model can be very
complex, it is typically just a dictionary


The dictionary contains the valid
pronunciations for each word


Examples:


Cat: k ae t


Dog: d ao g


Fox: f aa x s

Language Model


Now we need some way of representing the
likelihood of any given word sequence


Many methods exist, but ngrams are the
most common


Ngrams models are trained by simply
counting the occurrences of words in a
training set

Ngrams


A unigram is the probability of any word in
isolation


A bigram is the probability of a given word
given the previous word


Higher order ngrams continue in a similar
fashion


A backoff probability is used for any unseen
data

How do we put it together?


We now have models to represent the three
parts of our equation


We need a framework to join these models
together


The standard framework used is the Hidden
Markov Model (HMM)


Markov Model


A state model using the markov property


The markov property states that the future
depends only on the present state


Models the likelihood of transitions between
states in a model


Given the model, we can determine the
likelihood of any sequence of states

Hidden Markov Model


Similar to a markov model except the states
are hidden


We now have observations tied to the
individual states


We no longer know the exact state sequence
given the data


Allows for the modeling of an underlying
unobservable process

HMMs for ASR


First we build an HMM for each phone


Next we combine the phone models based
on the pronunciation model to create word
level models


Finally, the word level models are combined
based on the language model


We now have a giant network with potentially
thousands or even millions of states

Decoding


Decoding happens in the same way as the
previous example


For each time frame we need to maintain two
pieces of information


The likelihood of being at any state


The previous state for every state

State of the Art


What works well


Constrained vocabulary systems


Systems adapted to a given speaker


Systems in anechoic environments without
background noise


Systems expecting read speech


What doesn't work


Large unconstrained vocabulary


Noisy environments


Conversational speech

Future Work


Better representations of audio based on
humans


Better representation of acoustic elements
based on articulatory phonology


Segmental models that do not rely on the
simple frame
-
based approach

Resources


Hidden Markov Model Toolkit (HTK)



http://htk.eng.cam.ac.uk/


CHIME ( a freely available dataset)



http://spandh.dcs.shef.ac.uk/projects/chime/PCC
/datasets.html


Machine Learning Lectures


http://www.stanford.edu/class/cs
229
/


http://www.youtube.com/watch?v=UzxYlbK
2
c
7
E