FOR MOBILE SYSTEMS

moancapableAI and Robotics

Nov 17, 2013 (3 years and 8 months ago)

71 views

SPEECH RECOGNITION
FOR MOBILE SYSTEMS

BY:

PRATIBHA CHANNAMSETTY

SHRUTHI SAMBASIVAN


Introduction


What is speech recognition?



Automatic speech recognition(ASR) is the process by which a computer
maps an acoustic speech signal to text.





CLASSIFICATION OF SPEECH RECOGNITION SYSTEM



U
sers


-

Speaker dependent system


-

Speaker independent system


-
Speaker adaptive system


Vocabulary


-
small vocabulary : tens of word


-
medium vocabulary : hundreds of words


-
large vocabulary : thousands of words


-
very
-
large vocabulary : tens of thousands of



words.

CLASSIFICATION OF SPEECH RECOGNITION SYSTEM


Word Pattern



-

isolated
-
word system

:
single words at a time




-

continuous speech system : words are connected


together







HOW SPEECH RECOGNITION WORKS

APPLICATIONS


Healthcare



Military



Helicopters



Training air traffic controllers



Telephony and other domains


WHY SPEECH RECOGNITION?


Speech is the easiest and most common way for people to
communicate.



Speech is also faster than typing on a keypad and more expressive than
clicking on a menu item.



Users with low literacy.



Cellphones have widely proliferated the market.




CHALLENGES ON MOBILE DEVICES


Limited available storage space


C
heap and variable microphones


No hardware support for floating point arithmetic


Low processor clock
-
frequency


Small cache of 8
-
32 KB


Highly variable and challenging acoustic environments ranging from heavy
background traffic noises to a small room with reverberation of multiple
speakers speaking simultaneously


Consume a lot of energy during algorithm execution


ASR MODELS



Embedded speech recognition




Speech recognition in the cloud



Distributed speech recognition



Shared speech recognition with user based adaptation(proposed model of
use
)





EMBEDDED MOBILE SPEECH RECOGNITION

EMBEDDED MOBILE SPEECH RECOGNITION

Advantages


Not rely on any communication with a
central

server



Cost effective




Not affected by the latency


EMBEDDED MOBILE SPEECH RECOGNITION

Disadvantages




Cannot perform complex computations



Lack in terms of speed and memory



To achieve reliable performance, modifications


need

to be made to every sub
-
system of the ASR to take


both factors into account.






SPEECH RECOGNITION IN THE CLOUD

SPEECH RECOGNITION IN THE CLOUD


Advantages



Improves speed and accuracy



It provides an easy way to upgrade or modify


the
central speech recognition system.



It
can be used for speech recognition
with


low
-
end mobile devices such as
cheap
cellphones.








SPEECH RECOGNITION IN THE CLOUD

Disadvantages


Performance degradation



Acoustic models on the central server need to
account for large
variations in
the different channels.



Each
data transfer over the
telephone
network can cost money for the
end user.




DISTRIBUTED SPEECH RECOGNITION

DISTRIBUTED SPEECH RECOGNITION

Advantages



Does not really need high quality speech



Improve word error rates



DISTRIBUTED SPEECH RECOGNITION

Disadvantages


The major disadvantage of this mode still remains cost and the need of
continuous and reliable cellular connection,.


There’s a need for standardized feature extraction processes that account
for variability's arising due to differences in


channel ,
multi
-
linguality, variable accents,



and gender differences, etc.




SHARED SPEECH RECOGNITION WITH USER
BASED ADAPTATION

SHARED SPEECH RECOGNITION WITH USER BASED
ADAPTATION

Advantages



The ability to function even without network connectivity.



Works well for the limited set of conditions it encounters.



It can be covered successfully by existing mobile devices
, if
trained or
adapted
accordingly.



Server capacity has to be provided only for average, not peak use.


Speech recognition Process in detail

Front
-
end
Process

Involves spectral
analysis that
derives feature
vectors to capture salient
spectral
characteristics of
speech
input.


Backend Process

Combines
word
-
level matching
and sentence
-
level
search to perform an
inverse operation
to decode
the message from the speech
waveform.

Acoustic model


Provides a method of calculating the likelihood of any feature vector
sequence Y given a word W.


Each phone is represented by a HMM.


Language Model


The
purpose of the language model is to take advantage of linguistic
constraints to compute the probability of different word sequences


Assuming
a sequence of
𝐾
words,
𝑊
={
𝑤
1,
𝑤
2,…,
𝑤
k},
the probability
𝑃
(
𝑊
)
can be expanded as


𝑃
(
𝑊
)=(
𝑃
𝑤
1,
𝑤
2,…,
𝑤
k)


We
generally make the simplifying assumption that any word
𝑤
𝑘

depends
only on the previous
𝑁
−1 words in the sequence


This
is known as an N
-
gram model


Grammars


Use context free grammars represented by Finite State
Automata (FSA)




Overview of Statistical Speech recognition

Statistical Speech recognition model


Word sequence is postulated and the language model computes its
probability.


Each word is converted into sounds or phones using pronunciation
dictionary.


Each phoneme has a corresponding statistical Hidden Markov Model
(HMM).


HMM of each phoneme is concatenated to form word model and the
likelihood of the data given the word sequence is computed.


This process is repeated for many word sequences and the best is chosen as
the output.


Statistical Speech recognition model

Speech recognition on embedded platforms


Embedded ASR can be deployed either locally or in a distributed
environment with both advantages and disadvantages.



For LVCSR, embedded devices are limited in terms of CPU power and
amount of memory.



Most importantly, speed is a limiting factor.

Decoding algorithm


Asynchronous stack based decoder


memory efficient but complex.

Viterbi based decoder


most efficient.

3 types of search implementation

Combination of static graph and static search space

Static graph space with dynamic search space

Dynamic graph


Mobile speech frameworks


Nuance
-

Dragon mobile SDK


Openears


Sphinx


CeedVocal

SDK


Vlingo



Dragon Mobile SDK


The Dragon Mobile SDK provides speech recognition and text
-
to
-
speech
functionality.

The Speech Kit framework provides the classes necessary to perform
network
-
based speech recognition and text
-
to
-
speech synthesis.

It uses SystemConfiguration and AudioToolbox frameworks.

Speech kit architecture

OpenEars


OpenEars is an iOS framework for iPhone voice recognition and speech
synthesis (TTS).

It uses the open source CMU Pocketsphinx, CMU Flite, and CMUCLMTK
libraries.


OpenEars works by doing the recognition inside the iPhone without using the
network.

Sphinx


CMU Sphinx is a open source toolkit for speech recognition developed
by Carnegie Melon University.

CMU Sphinx is a speaker
-
independent large vocabulary continuous
speech recognizer.


Pocketsphinx


lightweight recognizer library written in C.

Sphinx4


adjustable, modifiable recognizer written in Java.



CeedVocal

SDK


CeedVocal SDK is a
isolated
word
speech
recognition SDK for iOS
.


It operates locally on the device
and
supports 6 languages : English, French,
German, Dutch, Spanish and Italian.

Mobile applications using speech recognition


Google now


Siri


S
-
Voice


Dragon Search


Dragon Dictation


Trippo
-
Mondo


Verbally




References

1.

Rethinking Speech Recognition on Mobile Devices,
Anuj

Kumar,
Anuj

Tewari
,
Seth
Horrigan
,
Matthew
Kam
,
Florian
Metze

and John
Canny.

2. Towards large vocabulary ASR on embedded platforms,
Miroslav

Novak.

3.

Speech Recognition: Statistical
Methods,

L R
Rabiner
,

B
-
H
Juang
.

4
. http://
www.nuancemobiledeveloper.com, 9
th

April 2013.

5. http://
cmusphinx.sourceforge.net , 9
th

April 2013.

6. http://
www.politepix.com/openears.

7. http://
www.creaceed.com/ceedvocal