Voice Recognition

matchmoaningAI and Robotics

Nov 17, 2013 (3 years and 9 months ago)

64 views

Voice Recognition

Lawrence Pan

Syen Hassan

Jamme Tan


Overview


History of voice recognition


Why voice recognition?


Technology behind voice recognition


Five major steps


Common applications


Current leaders


Demonstrations


Product Evaluation


Implementation of our own voice recognition system


Grade retrieval system for EE3414


Future Challenges


History of Voice Recognition


Radio Rex (house trained dog), 1922


U.S Department of Defense, 1940’s


Speech Understanding Research (SUR)
program


Carnegie Mellon University & MIT


Automatic interception & translation of Russian
radio transmissions (FAILURE)


Original message: “the spirit is willing but the flesh is
weak”


Translated message: “the vodka is strong but the
meat is disgusting.”


History Cont’d


First major achievements


Bell Laboratories, 1952


Successful recognition of numbers 0 to 9, spoken
over telephone


MIT, 1959


Successful recognition of vowels with 93% accuracy


Carnegie Mellon University, 1970’s


HARPY system: capable of recognizing complete
sentences


History Cont’d


Obstacles


Computing power: over 50 computers needed
for HARPY system to perform


Ability to recognize speech from any person


Taking in account different accents, speech tones,
etc.


Ability to recognize continuous speech


so…we…do…not…have…to…speak…like…this!


Commercialization of voice recognition
systems


History Cont’d

Computation required and
computation available in available
processors over time

Accuracy and task complexity
progress over time

Why Voice Recognition?


Convenience


Natural user interface: human speech


Improved services for the disabled


Wider range of users


Future possibilities and improvements


Internet use over phones through voice portals


Advanced applications implementing voice
control in all areas




Technology behind Voice Recognition


Five major steps used by speech recognizer


Five major steps in voice recognition


Capture and Digitalization


System interacts with the telephony device to capture
voice input at 8000 samples/sec


Spectral Representation


Voice samples converted to graphical representation


Segmentation


Speech signals are broken down into segmented
parts.


Improves accuracy


Reduces computation: impossible to process entire
signal in real time


Graphical Representations

Acoustic Model


Phonemes


smallest phonetic unit in a
language


Creates distinction between other words


e.g.
b

in boy and
t

in toy


Allophone


different pronunciations of a
phoneme/letter


E.g.
t

in tab,
t

in stab,
tt

in stutter


Database (Lexicon) of all words known to the
system for a language


Should contain several recordings for certain words


E.g. “the” can be pronounced “duh” or “dee”

Acoustic Model Cont’d


Trelliss


Data structure made up of all possible
combinations of allophones


Training of Acoustic models


For single
-
user systems


Text is read by user and recognized by system


For multi
-
user systems


Utterances spoken by many users compiled into a
database, then inputted into a recognizer


Weights are put on certain allophones

Language Model


Languages have structures (i.e. grammar)


Difference between two words can be difficult to
understand


Can be distinguished using context


E.g. “ours” and “hours” can be determined if previous
word is “two”

Common Applications


Call Center Automation



Widely used in all industries (consumer interface)


Airline companies: booking flights, general info, etc.


Banking companies: “pay by phone”, account
balances, etc.


Delivery Services (FedEx): tracking orders, etc.


All general customer service systems


Computer Integration of voice recognition


Personal Computers


Speech to Text Dictation


Accessibility purposes: voice control of computers


Common Applications cont’d


Integrated into
automobiles:


Visteon Voice
Technology™ used in
Infiniti Q45


Controls:


Climate


CD player


Navigation system

Competing Standards


VoiceXML (extensible markup language)


Partners: AT&T, IBM, Motorola, Lucent Tech.


Used in implementation of most voice portals


Shifting target toward web developers


SALT (Speech Application Language Tags)


Partners: Microsoft, Intel, Cisco, SpeechWorks


Targeted toward web developers

Current Leaders


Dragon Systems:


Naturally Speaking: P
C based user side programs for Automated
speech recognition (ASR)


Automotive, Telephony, Mobile, Games, Embedded Chips


SpeechWorks: Connects users to industry voice portals


AOLByPhone, FedEx, E*Trade, etc.


BeVocal: provides voice portals for Bell South, etc.


TellMe: provides voice portals for AT&T, Merrill Lynch,
etc.


Philips Speech Recognition


Services automotive, mobile device, and consumer electronic
industries


IBM Via Voice, MS Agent

Demonstrations


SpeechWorks
TM

product line


United Airlines' toll free flight information line (demo)


BankWorks Automated Bill Payment (demo)


FedEx Rate Finder (demo)


E*Trade Stock (demo)


AOLbyPhone service (demo)


BeVocal solutions

Magical Merlin’s Grade Retrieval System


Designed in Visual Basic using Microsoft’s
MSAgent


Menu

Recognized voice commands

First Exam

First Exam, First Test, First Midterm

Second Exam

Second Exam, Second Test, Second
Midterm

Quiz Grades

Quiz Grades, Grade on Quizzes

Homework Grades

Homework Grades, Grade on Homework

Project Grade

Project Grade, Grade on Project

Final Grade

Final Grade, Grade for course

Main Menu

Main menu, Main, Class

Click on my
belly for a short
demonstration

Future Challenges


Speech Technology


VoiceXML vs. SALT


Voice enabling web content


Real time access to source data


Stock market, traffic, sports, etc.


Clear connection needed for effective use of
voice portals


Security Issues involved


Advertising based revenue


References


http://www.stanford.edu/~jmaurer/homepage.htm


http://www.bevocal.com/corporateweb/technology/index.html


http://www.speechworks.com/demos/index.cfm


http://www.speechworks.com/learn/index.cfm


http://www.scansoft.com/realspeak/tts2500/


http://www.out
-
loud.com/speechacts.html


http://www.gignews.com/fdlspeech1.htm


http://www.gignews.com/fdlspeech2.htm


http://www.gignews.com/fdlspeech3.htm


http://www.microsoft.com/msagent/default.asp