speech recognition & interactive voice response ... - IJIS Institute

movedearAI and Robotics

Nov 17, 2013 (3 years and 6 months ago)

57 views

An IJIS Institute Briefing Paper









































E
MERGING
T
ECHNOLOGY
W
HITE
P
APER










S
PEECH
R
ECOGNITION
&

I
NTERACTIVE
V
OICE
R
ESPONSE
T
ECHNOLOGY




An IJIS Institute Briefing Paper

Speech Recognition & Interactive Voice Response Technology

Emerging Technologies Committee 2

A
CKNOWLEDGEMENTS

The IJIS Institute would like to thank the following individuals and their sponsoring
companies for their dedication and input on this document:

Matthew A. D’Alessandro, Motorola –
Committee Chair

John Crouse, ACS Government Solutions –
Committee Co-Chair


Martin Pastula, Microsoft Corporation















This project was supported by Grant No. 2003-LD-BX-0007 awarded by the Bureau of Justice
Assistance. The Bureau of Justice Assistance is a component of the Office of Justice Programs,
which also includes the Bureau of Justice Statistics, the National Institute of Justice, the Office of
Juvenile Justice and Delinquency Prevention, and the Office for Victims of Crime. Points of view or
opinions in this document are those of the author and do not represent the official position or policies
of the United States Department of Justice.

An IJIS Institute Briefing Paper

Speech Recognition & Interactive Voice Response Technology

Emerging Technologies Committee 3

I
NTRODUCTION

Over the past decade, the computer industry has made significant progress with speech
recognition. Speech recognition and text-to-speech synthesis technologies continue to be
adopted successfully by government agencies and Fortune 500 companies. These
organizations have typically deployed large enterprise-grade proprietary platforms into
their call centers and realized significant business benefits despite the high costs of
deploying such technology. Still, the costs and complexities of developing speech platforms
and applications have typically been out of reach for smaller companies and government
agencies. Today this is changing as the speech recognition industry is evolving; system
quality has improved while costs and deployment times have been reduced drastically.
This paper addresses the foundations of speech recognition, the various speech recognition
systems available today, their costs, and the potential application for Justice, Public Safety,
and Homeland Security organizations.



Speech recognition is
one of the most
complex challenges
facing computer
engineers today and is
often touted to be the
“holy grail” of
computing.


What is Speech Recognition?
Speech Recognition is the automated
process of recognizing spoken words, i.e.
speech and then converting that speech to
text for use by a computer application. It
consists of digitizing an audio-stream and
then parsing that digitized data into
meaningful segments. The segments are
mapped against a database of known
phonemes (essentially sound bites
representing different character sets) and
phonetic sequences are mapped against a
known vocabulary or dictionary of words.
Speech recognition is one of the most
complex challenges facing computer
engineers today and is often touted to be
the “holy grail” of computing. In recent
years, increased processing power has
improved memory capacity and speech
algorithms, making speech recognition
much more commercially viable for law
enforcement applications. In addition,
this technology has been implemented
successfully in a variety of other justice
areas including the Courts.
An IJIS Institute Briefing Paper

Speech Recognition & Interactive Voice Response Technology

Emerging Technologies Committee 4

Two Main Types of Systems
Speech recognition falls into two main
types: speaker-dependent continuous-speech
PC-based systems, and speaker-independent
continuous-speech server-based systems.
Speaker-dependent requires operators to
“train” the system by speaking the words
that will be used, which entails an extra
set-up operation at the end user site; every
individual operator of the speech system
must train the system to recognize his/her
own voice, speaking the words that are to
be recognized in the final application.
This can take from ten minutes to several
hours, but needs be done only once per
operator.
Speaker-independent means that, in theory,
any individual can speak commands to
the computer, without having to “train”
the system for his/her voice. The
cornerstone of this technology is that
rather than having one person train the
system, hundreds or thousands of people
’train’ it as part of the recognition engine
development cycle. The speech product is
delivered to the end user in a working
stage; the computer performs statistical
matches between what any speaker says
and the “canned” library of speech
patterns.
Up until now, speech recognition was
typically divided into non-continuous
(discrete) and continuous speech systems.
In non-continuous speech systems, users
must use pauses between spoken words
(Hello–pause-my-pause-name-pause-is-
pause-Greg). In continuous speech
systems, users can say “Hello my name is
Greg” without pausing.
This paper addresses the implications of
server-based systems as applied to
telephony, which is one of the biggest
markets for speech recognition systems.
In a stand-alone or PC-based
configuration, the speaker is connected
directly to the recognition engine and uses
the same microphone all the time; in a
telephony configuration, the public
switched telephone network is in the
middle and a wide variety of input
devices are used (cell phones, headsets,
pots phones) are in use.


Speech recognition
falls into two main
types:
speaker-dependent and
speaker-independent
continuous-speech
systems.


Examples of telephony-based applications
that are commercially deployed today
include:
• Home banking applications enabling
customers to query their account
balances, the status of cleared checks,
etc.
• Shopping programs allowing
customers to specify products, enter
credit card numbers, addresses, and
other pertinent information. Shopping
can be a 24 hour a day activity,
requiring only a working phone line.
• Companies can fax back product
literature in response to voice requests.
Customers can quickly specify the exact
literature they need without wading
through layers of menu selections. The
same approach works for technical
support.
An IJIS Institute Briefing Paper

Speech Recognition & Interactive Voice Response Technology

Emerging Technologies Committee 5

• Queries can also be handled in
response to voice prompts.
Transportation companies can provide
schedule information (What day are
you traveling on? etc.) without
requiring users to enter an endless
chain of keypad responses. This same
approach is used by entertainment
companies.
• Advanced messaging systems forward
voice calls, email, and faxes after
receiving voice commands.
• In call routing, the caller may be asked
to say what department or individual
he is calling and then the call is
automatically sent to the right
extension.
• Cellular phone services provide voice-
activated calling after a user dials into
the central site, most likely via a speed
dial button on the cellular phone. “Call
Andrew Davis” is then all that’s
needed for call completion. This
enables users such as vehicle operators
to keep their eyes on the road.
• Telephone companies can provide
directory assistance for businesses as
well as for residences. Users speak the
name of the city and the name of the
business; the computer responds with
the telephone number.
• Simple queries that result in simple
decisions can be completely automated.
“Will you accept the collect call from
XYZ” can be totally automated,
reducing the costs of providing phone
services.
• Brokerage and stock exchanges are
looking at voice-activated systems to
track orders and enter them into the
system.
For a broader array of organizations to
fully realize the benefits of adopting and
deploying speech technologies, companies
need lower cost speech platforms that
leverage and easily integrate to their
existing industry standard telephony, IT
systems, and applications. Low costs
speech and open standards-based
languages (web-based technologies), such
as Voice eXtensible Mark-up Language
(VoiceXML) and Speech Application
Language Tags (SALT), are now helping
companies and government agencies of all
sizes accelerate the use of speech-enabled
integrated voice recognition and web
applications on a single platform.
An IJIS Institute Briefing Paper

Speech Recognition & Interactive Voice Response Technology

Emerging Technologies Committee 6

D
OMAIN
A
PPLICATION OF
T
ECHNOLOGY

Speech Recognition has been used
primarily to assist in the exchange of
information, improve performance of
common business processes, increase
administration efficiencies, and to assist
citizens interacting with a variety of
justice agencies at the Federal, State, and
local levels across the US. Specifically,
agencies are using it to leverage the
information contained in integrated
criminal justice systems.


Companies need lower
cost speech platforms
that leverage and easily
integrate to their
existing…telephony,
IT systems, and
applications.


The primary benefits of Speech
Recognition technology are:
• Improvement of Information Access—
Speech applications are aimed at
eliminating agents’ routine tasks and
providing users with greater access to
information.
• Communications—Speech applications
make access to individuals, groups, or
electronic information easier, especially
when a computer may not be available.
Cell phones are one means to achieve
this capability.
• Customer/Citizen care—Increasing call
center automation and transaction
completion rates ultimately improve
operational efficiencies and reduce
operating costs.
• Transactions—Speech applications help
users conduct business and personal
transactions in a secure and
personalized environment.
• Productivity—Speech applications
streamline the efficiency of internal
government processes, potentially
reducing the costs of operations.

An IJIS Institute Briefing Paper

Speech Recognition & Interactive Voice Response Technology

Emerging Technologies Committee 7

H
ARDWARE OR
S
OFTWARE
R
EQUIREMENTS

The Speech Recognition industry can now
be considered a modular industry
meaning that there are distinct
components making up the solution. The
emergence of open standards and higher
accuracy, lower cost speech software has
made this possible. The speech
recognition industry can be broken down
into the following components; engines,
platforms, and applications as depicted in
Figure 1.
Engine providers provide the software to
perform the recognition operation. In
simple terms, we can consider the engine
to be the mechanism that takes an audio-
stream and applies grammar rules
resulting in a list of words that have been
“calculated” or “estimated” of what was
said.
On right side in Figure 1, we have
application providers or system
integrators. They build on the engine to
provide specific applications. Such
applications can be tailored to a
company’s purpose or they can be generic
applications, such as an auto attendant.
An auto attendant prompts users to
respond to certain questions by selecting
certain responses throughout the voice
recognition information acquisition phase.
Application developers build relevant
dialogs and perform any required
usability testing. Application developers
also provide integration with existing
legacy or back-end systems.
The third component to server-based
speech recognition is typically the
platform that manages the telephony
interactions between telephony services,
engines, and applications. Until recently,
platform providers (Interactive Voice
Response) were based on proprietary
software. However, the emergence of
open standards such as SALT &
VoiceXML, have given way to a modular
industry and a resulting architecture that
allows engines and applications to be
completely independent.



Figure 1: Components of Speech Recognition “Value Chain”

An IJIS Institute Briefing Paper

Speech Recognition & Interactive Voice Response Technology

Emerging Technologies Committee 8

E
NTITIES
R
ESEARCHING OR
D
EVELOPING THIS
T
ECHNOLOGY

A variety of organizations are involved in the research and development of speech
recognition engines, speech recognition platforms, and the associated standards or models
under which they operate including SALT and VoiceXML.
Vendors involved in this area include:
Engine Vendors
Vendor
Contact
Product
BBN Technologies www.bbn.com BBN Hark HLT
IBM Pervasive Computing www.ibm.com WebSphere Voice Server (Speech
Recognition)
Loquendo www.loquendo.com Loquendo ASR
Microsoft Corp www.microsoft.com Microsoft Speech Server 2004
Nuance
(merged with ScanSoft)
www.nuance.com Nuance 8.5, (ScanSoft - OpenSpeech
Recognizer, SpeechPearl)
Phonetic Systems
(merged with ScanSoft)
www.nuance.com Voice Search Engine
Telisma www.telisma.com Telisma ASR 3.2

Platform & Engine Vendors
Vendor
Contact
Product
Microsoft www.microsoft.com Microsoft Speech Server
Nuance www.nuance.com Nuance Voice Platform (NVP)

Application Vendors
Vendor
Contact
Computer Talk www.computer-talk.com
Gold Systems www.goldsystems.com
Nuance www.nuance.com
Infosys www.infosys.com
Intervoice www.brite.com
WiPro www.wipro.com


An IJIS Institute Briefing Paper

Speech Recognition & Interactive Voice Response Technology

Emerging Technologies Committee 9

A
VAILABILITY
,

C
OST
,
AND
F
ORECAST
P
RESENCE

The costs of deploying a speech
recognition telephony system can vary
widely depending on the size of the
system and its included functionality.
Generally, technology prices range from
$500 to $1,500 per port. The average
agency can expect to pay between 50K
and 150K for an “average” application.
Deployment & Maintenance costs are
typically extra and are usually based as a
fixed percentage of the total solution.
Speech recognition vendors are presenting
their telephony products as a
complement, rather than competition, to
web sites, as well as a natural extension of
mobile enterprise networks. Speech is
also being used as a tool for handicapped
users. Current industry research indicates
that the phone is still the most popular
way of communication and information
requests.
S
TRENGTHS
,

W
EAKNESSES
,

&

R
ISKS

Strengths
• Lower cost of
ownership by using
lower cost speech
technology and
eliminating
proprietary
technology.
• Simplify speech
integration, making
speech technology
available to the general
market and
application developer.
• Leverage staff web
developers to
implement speech
technology.
• Easily leverage
centralized
implementation of
standards and
technology like XML
or Web Services can
provide for lower total
costs of ownership.
Weaknesses
• Speech Recognition
accuracy may be
affected by
background and
ambient noise. This
may limit its
applicability to field-
related public safety
applications.
• Speech Recognition, if
not implemented with
a well-designed Voice
User Interface (VUI),
may not function
properly and leave
users with a negative
experience.
Risks
• Vendor financial
stability should be
taken into
consideration when
evaluating speech
vendors.
• Ever-evolving
emerging standards –
agencies should align
with standards with
strong industry
support like SALT or
VoiceXML.

An IJIS Institute Briefing Paper

Speech Recognition & Interactive Voice Response Technology

Emerging Technologies Committee 10

A
SSOCIATED
S
TANDARDS

Open Standards that should be considered include:
• SALT (SALTforum.org).
• VoiceXML (VoiceXMLforum.org, W3C).
• CCXML (W3C).
• Speech Recognition Grammar Specification (GrXML).
C
URRENT
M
AINSTREAM OR
A
LPHA
/B
ETA
U
SERS OF THIS
T
ECHNOLOGY

The technologies referenced above are all available as of the writing of this document. Its use
has not predominated based on the previous technical limitations, costs of the technology, and
the complexity of its successful implementation.

State of Alabama
The State of Alabama Integrated Criminal
Justice System (SAICS) deployed a speech-
enabled application utilizing Microsoft
Speech Server and a Microsoft partner
Computer Talk assisting two groups.
• Police Officers and Dispatchers—Suspect
identity verification data was not
accessible by SAICS officers when they
were away from regular computers. To
help 6,100 officers in the field and
reduce the burden on dispatchers for
suspect identity verification data,
officers can access driver’s license,
social security, and license plate data
over the phone with a direct voice
query. The application transmits data
verbally and visually for devices that
support multimodal applications.
• Child Support Court—Alabama has also
deployed a speech-enabled application
for child support court enforced
payments. The court enforced
payments application allows users to
call the state and say their alpha-
numeric case number to determine the
status of their child support check.
State of New Jersey
The State of New Jersey has implemented
several self-service applications utilizing
IBM WebSphere voice response capabilities
including:
• State Court Systems—This system is
used by attorneys to sign up for court
dates for pending trials. The system
automates the court date registration
process. Callers can also access the
system to find out when cases are
scheduled.
• Welfare Reporting—Welfare recipients
call into this system to verify their
employment (or unemployment) status.
Once the status is verified, Welfare
coverage is determined, and feedback
is provided to the caller.

An IJIS Institute Briefing Paper

Speech Recognition & Interactive Voice Response Technology

Emerging Technologies Committee 11

L
INKS TO
M
ORE
I
NFORMATION

• US Department of Justice, Office of
Justice Programs, Information
Technology Initiatives,
http://www.it.ojp.gov/
• Project 54:
www.project54.unh.edu
• Boston Globe Article:
http://www.boston.com/ae/theater_a
rts/articles/2005/05/22/for_police_wo
rd_from_the_wise_is_sufficient?mode=
PF
• Washington Technology Article:
http://www.washingtontechnology.co
m/news/20_22/emergingtech/27310-
1.html.