A NEW WORD-RECOGNITION SIMULATION TECHNIQUE TO ENHANCE THE ROBUSTNESS AGAINST RECOGNITION ERRORS OF SPOKEN DIALOGUE SYSTEMS

joinherbalistΤεχνίτη Νοημοσύνη και Ρομποτική

17 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

63 εμφανίσεις

A NEW WORD
-
RECOGNITION SIMULATION TECHNIQUE TO
ENHANCE THE ROBUSTNESS AGAINST RECOGNITION ERRORS OF
SPOKEN DIALOGUE SYSTEMS


R. López
-
Cózar, A. J. Rubio, P. García, J. C. Segura

Dpto. Electrónica y Tecnología de Computadores


Universidad de Granada, 18071

Granada, España (Spain)


Tel.: +34
-
958
-
243271, FAX: +34
-
958
-
243230 E
-
mail: {ramon,rubio,pedro,segura}@hal.ugr.es



ABSTRACT


We have developed a spoken dialogue system for
Spanish, named SAPLEN, which aims to deal with
telephone
-
based product order
s and queries of fast
-
food restaurants' clients. The system uses a continuous
-
speech recognition module developed at our laboratory
that uses context
-
independent phoneme
-
like units
modelled by SCHMM (Semi
-
Continuous Hidden
Markov Models). The vocabulary si
ze is about 2.000
words, including restaurant
-
product names, names of
streets, avenues, squares, etc. The language is
modelled by bigrams. In our preliminary experiments
we used a speech recognition simulator which can
include, change or remove words in th
e sentences
uttered by the users, depending upon several
parameters that determine its performance. In this
paper we present the performance of the simulator and
describe how it assigns confidence values to the words.
Finally, we outline
the way

the demonstrat
ion at
Eurospeech’99 can be carried out, using the simulator
as interface between the dialogue system and the users.


1. INTRODUCTION


Applications of speech technology include all kinds of
system in which a part of the communication process is
carried out

by the voice. Real
-
world applications
involve a human being trying to communicate to a
machine to get some information or service. Relatively
simple speech recognition systems can be used to
achieve some goals when the dialogue is heavily
limited, as it i
s usually the case of isolated
-
word
speech recognition systems [1]. The dialogue
restrictions imply training and collaboration on the part
of the users. Unrestricted dialogue applications are
much more appealing as users do not need to be trained
and colla
boration requirements are minimal. These
systems include a dialogue module that is an important
part of the whole system [2]. Spoken dialogue systems
generally use confidence values assigned to the
recognised words. Every word has an associated
confidence
value that represents the recogniser’s
confidence on its correct recognition. If a confidence
value is under a threshold then the word is considered a
recognition error and the system must ask the user to
re
-
enter it. Alternatively, the system may require
a
confirmation for the word [3].


2. THE SPEECH RECOGNITION SIMULATOR


In our preliminary experiments we used a speech
recognition simulator which can include, change or
remove words in the sentences uttered by the users,
depending upon several parameters
that determine its
performance [4]. A
noise level

parameter (n
l
)
represents the negative effect upon the user's voice
signal of extraneous noise. A
distortion probability

parameter (d
p
) determines how many words in the
sentences uttered by the users are co
nverted into
recognition errors. Three additional parameters:
insertion probability

(i
p
),
remove probability

(r
p
) and
change probability

(c
p
), determine how many words
are inserted, removed or changed respectively because
of distortions. Then, the simulato
r takes as inputs the
sentences uttered by the users and provides as outputs
distorted versions of them. Additionally, it assigns a
confidence value
conf(w
i
)
,
0


conf(w
i
)



1,
conf(w
i
)


R

, to every word
w
i

in the distorted sentences. Two
parameters,
rec
ognition weight

(w
r
) and
language
weight

(w
l
), are used for this purpose.

T
he
simulation
technique considers that
the

confidence value

of
the

distorted words
must be low
. So that,
a
reduction
factor

parameter (
r
f
)

is used to

reduce the confidence
values
assigned

to such words.

The simulator takes
into account the number of words in the lexical classes
contained in the dictionary of the system. It also
considers
t
h
e

expectations about what the user will
probabl
y say in his/her next interaction [5].



Fig.

1. Speech recognition simulator


The simulation technique considers a

recognition

confidence

me
a
s
ure

conf
recon
(w
i
)

and a
language
confidence

measure

conf
lang
(w
i
)

to compute
conf(w
i
)
.
The first confidence measure takes

into consideration
the fact that generally some words are easier to be
recognised than other
s
. The second confidence
measure considers that the expected words in a
determined dialogue situation can be easier to be
recognised than the unexpected ones. The s
imulation
technique uses a noise function
noise(w
i
)
,
0


noise(w
i
)



1,
noise(w
i
)


R
, to estimate
conf
recon
(w
i
)
. This
function tries to model the existing noise when
recognising
w
i
. So that, to estimate
conf
recon
(w
i
)

the
simulation technique considers whe
ther
w
i

is a
distorted word, i.e. a word that has been inserted,
changed or removed by the recognition simulator. If
w
i

is not a distorted word then
conf
recon
(w
i
)

is estimated as
follows:


conf
w
noise
w
recon
i
i
(
)
(
)


1


If
w
i

is a distorted word then

its confidence value

is

affected by the reduction factor parameter (
r
f
).
Then
,

conf
recon
(w
i
)

is estimated as follows:


conf
w
noise
w
r
recon
i
i
f
(
)
(
)


1


The technique co
nsiders

as well

the expectations of
what the user will probably say in his/her next
interaction to estimate the confidence on language
conf
lang
(w
i
)
. This value is estimated as follows:


conf
w
N
EXPECTED
t
EXPECTED
t
EXPECTED
t
lang
i
(
)
(
(
))
(
)
(
)



R
S
|
|
T
|
|
1
0
if w

if w
i
i


In this expression,
EXPECTED(t)

is a function
that
returns the set of all the expected words at time
t

in a
conversation, and
N( )

is the cardinal function. Finally,
conf(w
i
)
is calculated as the weighted sum of the
confidence values estimated before:


conf
w
w
conf
w
w
conf
w
i
r
recon
i
l
lang
i
(
)
(
)
(
)








If a user utters
an unexpected word
w
i

at time
t
, i.e. a
word
that

is not contained in
EXPECTED(t)
, then
conf
lang
(w
i
)
=0.
In this case
i
t can be observed

that

conf(w
i
)

depends only on the recognition confidence.
After a confidence value has been assigned

to every
word in the distorted sentences, the dialogue system
uses a confidence threshold
C
T

to decide which words
are considered as having been correctly recognised.
This is the case when
conf(w
i
)


C
T
.


3. DEMONSTRATION


At the moment the interaction wi
th users can be carried
out via keyboard or via voice in our lab. We still have
not set up the telephone interface, then during the
demonstration the users must talk to the system using
the keyboard. So that, the speech recognition simulator
can be used as

interface between the dialogue system
and the users to make the dialogue system work under
real
-
wor
l
d simulated conditions. During the
conversations, the users can make fast food and drink
orders. They can also obtain information about the
products availab
le at
a

restaurant (foods, drinks, prices,
ingredients, etc.). Our aim is that

these products can be
delivered

to home

when the system is commercially
working
. When a user orders for a product, the system
asks for his/her telephone number. If th
e user is a
known one for the system, it confirms the address data
it has previously stored in a database. Otherwise, the
system asks for the city
-
area and the address. After the
system
has
confir
med

all these data, it asks the user to
accept the final price to

pay as well as the estimated
transportation time of the ordered products

to home
.


4. CONCLUSIONS AND FUTURE WORK


In this paper we have introduced a new technique to
simulate the word recognition process carried out by
spoken dialogue systems. The technique reli
es on the
use of several parameters that decide how many words
in the sentences uttered by the users are changed. This
technique
assigns a confidence value to every word
,
which

represents the recognition confidence of a real
wor
d recogniser. We believe this technique can be
useful to develop robust spoken dialogue systems as it
allows checking and evaluating their performance
under different (simulated) recognition situations, by
simply readjusting several parameters. As future w
ork
we plan to study whether it is possible or not to
determine the values that must be assigned to the
parameters
in order to
simulate the performance of a
given
word recogniser.


5. REFERENCES


[1]

L. Rabiner, B.H. Juang. "Fundamentals of
Speech Recognition"
. Prentice
-
Hall, 1993.

[2]

Victor Zue. "Conversational interfaces:
Advances and challenges", Keynote 2, Eurospeech'97

[3


Thomas Kemp, Thomas Schaaf, “Estimating
Confidence Using Word Lattices”, Eurospeech ‘97, pp.
827
-
830

[4]

R. López
-
Cózar, A. J. Rubio,
P. García, J. C.
Segura, "A Spoken Dialogue System Based on a
Dialogue Corpus Analysis", First International
Conference on Language Resources and Evaluation
(LREC'98), pp. 55
-
58

[5]

Morena Danieli. “On the use of expectations for
detecting and repairing hu
man
-
machine
miscommunication”, Working notes of the AAAI
-
96, pp.
87
-
93