Cross-Modality Semantic Integration with Hypothesis Rescoring for Robust Interpretation of Multimodal User Interactions

pogonotomygobbleAI and Robotics

Nov 15, 2013 (3 years and 9 months ago)

84 views


1


We develop a
cross
-
modality sem
antic int
egration procedure

pertaining to automatic semantic interpretation of multimodal user
interactions using speech and pen gestures.
This is achieved by the
Viterbi alignment algorithm that enforces the temporal ordering of the
input events as well
as the semantic compatibility of aligned events.
The alignment enables generation of a
unimodal,
verbalized

paraphrase

that is semantically equivalent to the original multimodal
expression. Our experiments are based on a multimodal corpus in the
domain o
f city navigation. Application of the cross
-
modality
integration procedure to near
-
perfect
(manual)
transcripts of the
speech and pen modalities show that correct unimodal paraphrases are
generated for over 97% of the training and test sets. However, if
we
replace with
automatic
speech
and pen
recognition transcrip
t
s, the
performance drops to
5
3.7
% and
54.8
% for the training and test sets
respectively. In order to address this issue, we devised the
hypothesis
rescoring procedure

that evaluates all candid
ates of cross
-
modality
integration derived from multiple recognition hypotheses
from

each
modality. The rescoring function incorporates the
integration

score,
N
-
best purity of recognized spoken locative expressions
, as well as

distances between coordinate
s of
recognized
pen gestures and their
interpreted icons on the map. Application of
cross
-
modality
hypothesis rescoring

improved the performance to
6
7.5
% and 6
9
.9
%
for the training and test sets respectively.

I.

I
NTRODUCTION

We develop a
cross
-
modality seman
tic integration procedure

for

automatic semantic interpretation of multimodal user
interactions using speech and pen gestures.


Each modality in
the multimodal user input presents a different abstraction of the
user’s informational or communicative goal as

one or more
input events.
An input event may be a
spoken deictic
term/
phrase or a pen
action
. The semantics of an input event
may be imprecise (e.g. a pen stroke on a map may denote a
street or demarcation), incomplete (e.g. use of anaphora in

how about

the previous one
”?) or erroneous due to
misrecognitions (e.g. speech or pen gesture recognition errors).
These problems motivate us to
continue
our previous work in
[1
]
to
investigate


how

we may leverage the mutual reinforcements
and mutual disambiguat
ion across modalities [
2
] to achieve
robustness towards misrecognitions and imperfectly captured
inputs.


Previous approaches towards semantic interpretation of

multimodal input include
f
rame
-
based heuristic integration

[
3
,

4
]
, unification parsing

[5, 6
]
,
hybrid symbolic
-
statistical
approach

[7, 8
]
, weight
ed finite
-
state transducers

[9
]
,
probabilistic graph matching

[1
0, 11
]
and the salience
-
driven
approach

[12
]
.

We aim to devise a cross
-
modality (speech and
pen) semantic integration framework
that
draw
s

f
rom previous
experiences

but is extended
with several desirable featu
res
.


W
e
designed a robust interpretation framework with
an integrative

cost function that incorporates a weighted combination of
ranked confidence
scores
from speech recognition, pen
rec
ognition
, together with a

cross
-
modal compatibility score.
The semantic compatibility can be derived by a simple process
of exploratory data analysis

of

a training set
; hence our
approach does not involve grammar writing. The requirement
on training data

is relatively small, because the training data set
is used
only
for
human
reference
and
in tuning cost functions.
This alleviates the problem of over
-
training due to i
nsufficient
training data. The

cross
-
modality semantic integration
component can be di
rectly incorporated as a front
-
end to
our

existing spoken dialog system (SDS)

[1
3
, 1
4
, 1
5
]
. More
specifically, cross
-
modality semantic integration generates a
unimodal (verbalized) paraphrase of the multimodal input, for
further processing by the SDS, the
reby leveraging existing
capabilities in natural language understanding, as well as
discourse and dialog modeling. Additionally, the proposed
framework is largely domain
-
independent


portability across
information domains will require only a new domain
-
s
pecific
language model for the speech recognizer, as well as
a
set of
doma
i
n
-
specific features list,
which will
be elaborated later.


The following presents our work in the design and
collection of a multimodal corpus, characterizing speech and
pen gesture
s for unimodal interpretation, cross
-
modality
semantic integration, hypothesis rescoring, as well as empirical
performance evaluation.

II.

D
ESIGN AND
C
OLLECTION OF A
M
ULTIMODAL

C
ORPUS

Information in the multimodal corpus is focused on the
navigation of Beijin
g. This is because navigation a
l
ways
involves the use of map and in
put
s for spatial information. We
invited
2
3

subjects from
across University

to participate in data
collection

with a Pocket PC
.

The subject is requested to
formulate a
task
-
oriented
mult
i
-
modal in
put

according to the
instruction
s

provided
.

The in
put

designed may involve zero
(e.g. command of map rendering)
to
six spoken locative
references (SLR
s
) (the maximum number of locations involved
in a task). The in
put
s may also involve
zero to
s
ix
pen
instances.

The subjects are also informed on the use of SLRs
and pen gesture instances.

The
y

are also allowed to revise and
re
-
compose their own multi
-
modal in
put
s to express the task’s
constraints specified clearly.
We collected a total of 1,
51
8

manually transcribed
inputs from 2
3

su
bjects. Among the
in
put
s, 1
,442

of them are multi
-
modal (with speech and pen
gestures) in
put
s and the remaining
76

are uni
-
modal (speech
only) in
put
s. There are
3,590

instances of pen gestures in the
corpus in tota
l. We divided the 1,
442

multi
-
modal in
put
s
randomly into two separate data sets: training (
999
) and testing
(
443
). The training set is

used
to train

the weights of the
rescoring framework
.

III.

C
ROSS
-
M
ODALITY
I
NTEGRATION

The multi
-
modal input in this work is
composed

of two
modalities, speech and pen gestures. Each of the modalities
abstracts the user

s message differently into a sequence of (one
or more) input events. The input event in each of the modalities
is associated with imprecise, incomplete or inco
rrect semantic
information that is ambiguous. This may be due to the use of
inappropriate spoken deictic, noisy pen gesture inputs or
recognition
errors
.

Cross
-
Modality
Semantic
Integration
with

Hypothesis
Rescoring for
Robust Interpretation of Multimodal
User Interactions

Pui
-
Yu Hui and Helen Meng

Human
-
Computer Communications Laboratory

The Chinese University of Hong Kong


2

SLR of a spoken input
is a referential expression for
spatial information on map. It can be
direct r
eference by
full
name or abbreviation (e.g.
中國地質大學
,
地大
)

or indirect
reference of deictic. A deictic is possibly with specified
location types and numeric features (e.g.
這裡
;
這四所大學
;

些地方
)
. We create a list of hypotheses of possible semantic
interpretations
(i.e. locations on map) for each
SLR
. Direct
references will only have a single entry (the full name of a
location or
身處點

for
current location
)
as the hypothesis.
Indirect references will filter through all icons shown on the
map for locations with match
ed

location types if it is specified.
Otherwise, all the locations on map will be included as the
hypotheses. The numeric feature is stored as attribute
NUM

with
the hypothesis list too.

Pen g
esture can be point, circle and/or stroke that are all
ambiguo
us.
Coordinates of each pen gestures are compared
with the positional coordinates of the icons on the map.
Interpr
e
tation of each instance of pen gesture generates a
ranked list of hypothesis of locations in ascending order of
di
s
tance from the instance.

I
cons
locat
e

within 50 pixels from
the point
/endpoints of stroke

are considered
as the candidates of
the
semantic inte
r
pretations of the gesture.

Ic
ons overlap
ped

with

the area of circle are considered as the candidates of the
semantic interpretation of

the gesture.

An illustration of the
procedure is shown in Figure

1
.

Multimodal input

S:

我現在在

北郵

我要到

這四個大學

一共需要多少時間


P:













|

|

|



(
m
ultiple
sequential
strokes with 3 turning points
)

Hypothesis lists

of speech input

SLR
1
:
ABBREVIATION
=
北郵



京郵電大學

SLR2:
DEICTIC
=
這四個大學


NUM

= 4


LO
C
_
TYPE
=schools_and_public_librarie
s


subtype
=
university


中國地質大學
,
北京師範大學
,

北京郵電大學
,
……


(
all universities on the map shown
)

Hypothesis lists of pen input (locations ranked by distance in pixels)

PenDown:
TYPE
=
stroke


北京郵電大學

-
1


西土城路

5.4

……

TurningPt1:
TYPE
=
stroke



北京航空航天大學

-
1


北京航空館

5.0

……

TurningPt2:
TYPE
=
stroke


中國地質大學

1.9


學院路

11.0

……

TurningPt3:

TYPE
=
stroke


北京科技大學

0.6


學院路

11.4




PenUp:

TYPE
=
stroke


北京醫科大學

-
1


北醫三院

7.0
2
……

FIGURE
1
.


An illustration of the procedure for hypothesis l
ists
generation

in the speech and pen modalities respectively
.

The partial semantics
derived from each modality were
disambiguated using the Viterbi alignment
[1
6
]
with a scoring
function that
enforces the temporal ordering between the
sequence of SLRs and

the sequence of pen gestures. The
scoring function also enforces the semantic compatib
ility in
terms of num
eric (
NUM
) and location type (
LOC
_
TYPE
) features.

The
algorithm

generates the

optimal


path in aligning the
sequence of hypothesis list of SLRs w
ith the sequence of
hypothesis list of pen gesture in a multimodal input. The joint
integration procedure extracts the highest ranking location(s)
from each pair of the list of hypothesis so as to identify the
user

s intended location. The number of loca
tion extracted
follows the numeric feature specified.

We applied the cross
-
modality integration procedure to
both the
manually transcribed (perfect)
training and test

sets.
For each multimodal inquiry, we manually annotate the
alignment between an SLR a
nd a pen gesture. If the oracle and
system generated alignments completely agree with each other,
the multimodal inquiry is considered as correct.

The
cross
-
modality integration procedure generated correct
alignments between SLRs and pen gestures for 97.
5% of
training inquiries and 97.1% of the testing inquiries that contain
SLR(s). Further details are presented in

[1].

IV.

H
YPOTHESES
R
ESCORING FOR
R
OBUSTNESS TOWARDS
IMPERFECT TRASCRIPTI
ONS

We attempt to extend the cross
-
modality integration procedure
with t
he use of multiple recognition hypotheses in order to
achieve robustness towards recognition errors. Consider the
scenario
in which a speech recognizer generates
N
-
best
hypotheses based on the speech input
, while
the pen gesture
recognizer generates
M
-
bes
t hypotheses based on the pen input.

The hypotheses are rank ordered acc
ording to their recognition
scores in each individual modality.
As such, we will have
N

M

possible candidates for cross
-
modality integration
.

In
designing a
rescoring mechanism for
comparing

these
candidate
s

for

integration
, we

should consider such elements as
the qualit
y

of the recognized
SLRs
,

interpreted pen gestures and

the alignment
.

We will elaborate on these points in the
following subsections:

I
V.
1

Pruning and
Scoring the Re
cognized Spoken
Input
s

U
nder practical situations, captured inputs are much more
problematic, due to disfluencies in the speech modality (e.g.
filled pauses and repairs), spurious pen gestures and recognition
errors in both modalities. These imperfections

have adverse
effects on cross
-
modality integration.

A. Transcribing the Spoken Inputs


W
e transcribed the speech signals in the multimodal corpus
with

a Mandarin speech recognizer [
17
] that was originally
trained with speech data from a general open doma
in. Hence,
we replaced the recognizer’s general
-
domain lexicon
and
language model
with a domain
-
specific version. Speech
recognition performance evaluated based on the top
-
scoring
recognition hypotheses gave overall character accuracy of
44.6%.
The
perfor
mance is
poor due to the accented
Mandarin

collected and
background noise
. Application of the SLR
e
xtraction procedure
to the top
-
scoring recognition hypotheses
shows substitution, deletion and insertion errors in
the
SLRs.
SLR deletion and substitution
are the most prominent,
frequently caused by short duration of
這兒

/
zher/

and phonetic
confusion between


/
zhe
/ and


/
che
/. Overall, the SLR
recognition accuracies (each SLR is treated as a word) for the
training and test sets are 38.5% and 39.3% respect
ively. In
other words, over half of the SLRs have not been correctly
extracted.
However, the majority (>60%) of the incorrectly
recognized SLRs involves confusion with other SLRs carrying
the same semantic meaning and hence will not affect the
subsequent

cross
-
modality integration process.

Overall, 50.9%
and 51.7% of the recognized SLR in training and test sets
were
interpreted with correct semantic
s
.

B. Pruning
and Scoring the
Spoken Inputs

PenDown

TurningPt1

Tu
rningPt2

TurningPt3

PenUp


3

The speech recogni
zer

may generate non
-
sensical hypotheses in
the
N
-
best hypothesis list.

A
recognition transcript with a
small
value of perplexity is more likely to have a reasonable
interpretation.
Hence our pruning strategy targets the opposite
cases


h
ypothes
e
s with
large
perplexity
values exceeding

a
preset t
hreshold
are

filtered.


The speech component of a multimodal input expression
may be transcribed by speech recognition as a hypothesized
word sequence with
R

spoken locative references (SLRs). For a
segment of the speech signal with specific start and end

times,
we may observe transcription
s

across the
N
-
best

(
N
=100 in this
work)

speech recognition hypotheses. Let
H
r
S

denote the
r
th

SLR in one of the speech recognition hypotheses, which is also
the transcription of a specific speech signal segment. We m
ay
score the quality of this transcription by defining the normalized
cost
C
S
(H
r
P
,

N)

for the recognized SLR (
H
r
S
)

is:

)
1
.(
..........
/
)
,
(
1
,
N
N
H
n
N)
(H
C
S
r
S
r
S



where
n(H
r
S
, N)
is the number of times the speech segment is
transcribed as
H
r
S

across the
N
-
best speech recognit
ion
hypotheses

(
N
=100)
.
n(H
r
S
, N)/N
is known as the
N
-
best
purity

of the SLR
H
r
S
, where purity values range between 0 and 1. The
higher the purity, the more preferable the SLR
H
r
S
,
and the
lower is the normalized cost of
the
speech transcription
C
S
(H
r
S
,N
)
.

I
V.2
Filtering and
Scoring the
Recognized
Pen
Input
s

A. Filtering and Recognizing
the
Pen Inputs

We have designed a filtering mechanism
with reference to the
time and distance between two gestures
to remove the
repetition
s
.
We have
also
developed a pe
n gesture recognizer,
based on a simple algorithm that proceeds through a sequential
procedure of recognizing a point, a circle and a stroke
.
Th
e

simple
pen gesture recognition algorithm
can only generate a
single output hypothesis
.
Overall pen gesture r
ecognition
accuracy is
86.6
%.
Among the incorrectly recognized pen
gesture
s
, they
contain confusions may carry the same semantic
meaning and here the pen recognition error

will not affect the
subsequent integration process. Overall, 91.3% of the
recogniz
ed pen gestures can be interpreted with correct
semantic

meaning
.

B. Rescoring the Pen Inputs

A multimodal input expression may be transcribed as a
sequence of
Q

pen gestures

with recognized pen gesture type
.
Each is interpreted as a list of hypothesized

locations, i.e.
H
q
P

for the
q
th

pen gesture in the input expression. The
interpretations are based on locations on the map that lie within
a maximum distance
d
max

from the coordinates of the pen
gesture and are rank ordered based on these distances
d
k,q
,
m
,
where
k

indexes the hypothesized locations in
H
q
P

and may
range from
1
to
K
q
; and

m

indexes the number of recognized pen
gesture types and
M
=1 in current work.
To score a particular
interpretation
H
q
P
[j]

in the hypothesized list, we define the
normali
zed cost of interpretation for the pen modality
C
P
(H
q
P
[j],

M)

as shown in Equation 2
. The smaller the distance
d
j,q
, the
lower the normalized cost
C
P
(
H
q
P
[j]
,

M
) and the more
preferable the interpretation for the pen gesture. The
normalized costs of the
K
q

hypothesized locations in
H
q
P

will
sum to 1.
)
2
.(
..........
/
],
[
1
1
,
,
,
,





M
m
K
k
m
q
k
m
q
j
P
q
P
q
d
d
M)
j
(H
C

I
V.
3

Pruning
and Scoring

C
ross
-
M
odality

Integrations


The cross
-
modality integration procedure incorporates a simple
cost function for the Viterbi algorithm that penalizes for
mismatches

in
LOC
_
TYPE

and
NUM

features.
I
n handling the
imperfect
N
-
best
speech
recognition
and
M
-
best pen
recognition
outputs, we need to enforce tighter constraints on
semantic compatibility. We have establ
ished
that direct
and
indirect
references should be
sem
antically redundant

with the
corresponding pen gestures

[
18
]
. Hence we

prun
e

the

candidate
integrations

which
involve
mismatch
es

in locations
between interpreted pen gestures and direct references in
speech
.


Candidate integrations that survive the prunin
g
mechanism will each have an
integration

score
S
I
, which

is
computed with
the Viterbi
alignment
[1]

based on the pair of
hypothesis list
s (
H
R
S
,
H
Q
P
).
H
R
S

is the hypothesized
transcription of the speech input that contains
R

recognized
spoken locative ref
erences.
H
Q
P

is the hypothesized
transcription of the pen input that contains
Q

interpreted pen
gestures. We
define
the normalized
cost of
integration

C
I
(H
R
S
,

H
Q
P
)

as shown in Equation
3
.
S
I
max

is the maximum
possible
integration

score that is empirical
ly obtained from
training data.

1
,
0
)
3
.(
..........
/
)
,
(
,
max



)
H
(H
C
S
H
H
S
)
H
(H
C
P
Q
S
R
I
I
P
Q
S
R
I
P
Q
S
R
I

I
V.4 Rescoring Cross
-
Modality Integrations

We r
escor
e

the candidates
with the following procedures
:

1.

For each candidate, we apply cross
-
modality integration to its
pair of hypothesis lists (
H
R
S
,
H
Q
P
).

Should these include
incompatible semantics, the candidate is pruned. If the
candidate survives, we compute its normalized cost of
integration

C
I
(
H
R
S
,
H
Q
P
)

based on Equation 3
.

2.

We focus on the hypothesized transcription of the pen input
H
Q
P
. For each of

the
Q

interpreted pen gestures (indexed by
q
), we select the interpretation
j
q

that is semantically
compatible with its aligned SLR and compute the normalized
cost of pen interpretation
C
P
(
H
q
P
[
j
q
]
)

(see Equation 2)
.
Should there be multiple semantically
compatible
interpretations, their normalized costs are summed.

The
overall cost of interpreted pen gestures for
H
Q
P

is defined as:

)
4
...(
)
],
[
(
1
1



Q
q
q
P
P
P
Q
P
M
j
H
C
Q
)
(H
C
q

3.

We focus on the hypothesized transcription of the speech
input
H
R
S
. For each of the
R

recognized SL
Rs (indexed by
r
),
we compute its normalized cost of recognized SLR,
i.e.

C
S
(H
r
S
,N)
(see Equation 1)
, which is derived from the
N
-
best purity.

The overall cost of recognized SLR for
H
R
S

is
defined as:

)
5
(
..........
)
,
(
1
1



R
r
S
S
S
R
S
N
H
C
R
)
(H
C
r

4.

The rescoring function that is
used to evaluate each candidate
for cross
-
modality integration is a linear combination of the
three

normalized cost functions, i.e.

1

and

1
,
,
0
)
6
.........(
,
,








S
P
I
S
P
I
S
R
S
S
P
Q
P
P
P
Q
S
R
I
I
P
Q
S
R
Tot
w
w
w
w
w
w
)
(H
C
w
)
(H
C
w
)
H
(H
C
w
)
H
(H
C

We select values for the weights
w
I
=0.5,

w
P
=0.35

and

w
S
=0.15
,

by grid search to maximize

cross
-
modal
ity alignment
accuracies based on the training
data.
All candidates for
cross
-
modality integration are
rescored according to Equation
6

and re
-
ranked in ascending order of scores. The candidate with
minimum overall cost
C
Tot
(
H
R
S
, H
Q
P
)
is identified as th
e
preferred cross
-
modality alignment.

I
V.5 Evaluating the Rescoring Procedure


4

The application of the rescoring procedure to the candidate
hypotheses for cross
-
m
odality integration has brought some
improvements to the alignment accuracies in the training an
d
test sets of our multimodal corpus

as shown in Table 1
.
Improvement
s

in integration accuracies brought about by
cross
-
modality hypotheses rescor
ing
are

statistically significan
t
.
Further analysis of our results (see Table

2
) shows that
there can
be
cor
rect cross
-
modality integration
despite recognition errors
in speech and/or pen modalities.
The
N
-
best hypothesis
rescoring framework can effectively re
-
rank the hypothesis
pairs to obtain correct integration
, as illustrated by the example
in Table

3
.

An
alysis of the incorrect interpretations found that
deficiency in the timing information, handling of
the
unspecified numeric feature (e.g.
這裡

has
NUM
=
nil

and can be
aligned with any number of pen gesture instances without
penalty in the alignmentt) are the t
wo

main causes.
Incorporation of the timing information can help to reduce the
association between
the SLR and pen gesture with temporal
difference


. Generation of specific numeric feature can
provide a more specific alignment cost.

TABLE
1
. Performance of
the
cross
-
modality integration
(
in terms of
% of correctly aligned expressions
)

in the training

and test sets.


Training

Test

# expressions

957

425

Cross
-
modality integration of
oracle

transcriptions
in both modalities

based on the Viterbi Alignment

97.1%

97.5%

Cross
-
modality integration of t
op
-
scoring

speech
recognition hypothesi
s and
recognized

pen inputs

based on the Viterbi Alignment

5
3.7
%

54
.
8
%

Top candidate obtained after cross
-
modality
integration
and rescoring of the
N
-
best speech
recognition outputs (
N
=1
0
0) with
the first best
recognized pen input
s hypotheses.

6
7.5
%

69
.
9
%

TABLE
2
. Det
ailed performance statistics of the
test set
.

P
en

SLR

#inquiries
in

the
test set

(425 in
total)

C
orrect
integration with
top
-
scoring
hypotheses from
each modality

C
orrect integration
with
N
-
best
(
N
=100)
speech recognition
hypotheses and
M
-
best
(
M
=1)
pen
re
cognition
hypotheses





96/425

(22.6%)

96/96

(100%)

96/96

(100%)





256/425

(60.2%)

92/256

(
35.9%)

152/256

(59.4%)





40/425

(9.4%)

31/40

(77.5%)

32/40

(80%)





33/425

(7.8%)

14/33

(42.4%)

17/33

(51.5%)

Overall

54.8%

69.9%

TABLE
3
. Example

on th
e correct integration with the present of SLR
and pen recognition error
s
.

Reference
transcription
s
:

S:
這個公園

什麼

時候

開放





††††
(
a big point within the icon of a park
)

Top
-
scoring speech and pen recognition hypotheses:

S:
這兒

公園

什麼

時候

開放





†††
(
a circle within the icon of a park
)

Remark:
A
lthough
the numeric and location type features are
missed

in the recognized SLR and the point is
mistaken

as circle

by
the pen gesture recognizer
, the framework can integrate the two
modalities correctly and
identify

the park indicated by the user
.

V.

C
ONCLUSIONS AND
F
UTURE
W
ORK

We devise a
cross
-
modality semantic

integration procedure

to
align input events in the speech modality with those in the pen
modality using the Viterbi algorithm.
We designed and
collected a multimodal corpus
(with frequent SLRs)
in domain
of
city

navigation to support our investigation.
The speech and
pen modalities have been transcribed by hand.
The overall
speech character recognition and pen gesture type r
ecognition
accuracies

are 44.6% and

89.9
%

respectively
. Application of
cro
ss
-
modality integration to

manual

transcript
ion
s generat
ed
correct unimodal paraphrases for over 97% of the training and
test sets. However, if we replace with the top
-
scoring spee
ch
and pen
r
ecognition transcript
s
, the
performance drops
to
around
5
4
%. In order to achieve robustness towards imperfect
transcri
pts, we extend our framework with a
hypothesis
rescoring procedure
. For each multimodal expression, this
procedure considers all candidates for cross
-
modality
integration based on the
N
-
best
(
N
=100)
speech recognition
hypotheses and the
M
-
best
(
M
=1)
pen i
nput recognition
hypotheses

(each with
Q

generated
location hypothes
e
s
)
.
R
escoring combin
es such elements as the
integration
scores

obtained from the Viterbi algorithm
,
N
-
best purity for
recognized
SLR
s
,
as well as distances between coordinates of
recogni
zed
pen gestures and relevant icons on the map.
Experiments using the
N
-
best

(
N
=1
0
0) speech recognition
hypothesis
and top
-
scoring

(
M
=1) pen recognitio
n hypothese
s
show

that the rescoring and re
-
ranking helped improve the
test
set
performance

of correct c
ross
-
modality interpretation

to
69
.9
%.
Future work includes the investigation of
cross
-
modality timing information to aid semantic
interpretation

and the handling of
ellipsis in the speech
component of a multimodal expression
.

R
EFERENCES

[1]

Hui, P.Y. and H. Meng, “Joint Interpretation of Input Speech and
Pen Gestures for Multimodal Human
-
Computer Interaction,” in
the Proc. of Interspeech
, 2006.

[2]

Hauptmann, A. G., “Speech and Gestures for Graphic Image
Manipulation,” in
the Proc. of CHI
, 19
8
9.


[3]

Nigay, L. and J. Coutaz, “A Generic Platform for Addressing the
Multimodal Challenge,”
in
the
Proc. of CHI
, 1995.

[4]

Wang, S. “A Multimodal Galaxy
-
based Geographic System,”
S.M. Thesis, MIT, 2003.

[5]

Johnston, M. et al., “Unification
-
based Multimodal Integrati
on,”
in the
Proc.
of COLING
-
ACL
, 1997.

[6]

Johnston, M.,

Unification
-
based Multimodal Parsing,


in
the
Proc. of COLING
-
ACL
, 1998.

[7]

Wu, L. et al., “Multimodal Integration


A Statistical View,”
IEEE Transactions on Multimedia
, 1(4), pp.334
-
341, 1999.

[8]

Wahlster,
W. et al., SmartKom (
www.smartkom.org
)

[9]

Johnston, M. & S. Bangalore, “Finite
-
state Multimodal Parsing
and Understanding,”
in
the
Proc. of COLING
, 2000.

[10]

Chai, J., “A Probabilistic Approach to Reference Resolution in
M
ultimodal User Interfaces,”
in
the
Proc. of IUI
, 2004.

[11]

Chai, J., et. al.,

Optimization in Multimodal I
n
terpretation,


in
the Proc. of ACL
, 2004.

[12]

Qu, S. and J. Chai,

Salience Modeling based on Non
-
verbal
Modalities for Spoken Language Understanding,


in
t
he Proc. of
ICMI
, 2006.

[13]

Chan, S. F.

and

H.

Meng,


Interdependencies among Dialog
Acts, Task Goals and Discourse Inheritance in Mixed
-
Initiative
Dialog
,


in
the
Proc. of the HLT,

2002.

[14]

Wu, Z.Y., H. Meng, H. Ning and C.F. Tse, “A Corpus
-
based
Approach for Co
operative Response Generation in a Dialog
System”, in
the Proc. of ISCSLP
, 2006.

[15]

Meng, H. and D. Li, “Multilingual Spoken Dialog Systems,” in
Multilingual Speech Processing
, Academic Press 2006.

[16]

Brown, P.
,

et al., “
The Mathematics of Machine Translation:
P
arameter Estimation”,
Computational Linguistics
,
19(2):263

311, 1993.

[17]

Chang, E.,

et. al.
, “Large Vocabulary Mandarin Speech
Recognition with Different Approaches in Modeling Tones,” in
the Proc. of ICSLP
, 2000.

[18]

Hui,
P.Y.,
et. al.,

“Complementarity and Redu
ndancy in
Multimodal User Inputs with Speech and Pen Gestures,” in
the
Proc. of Interspeech
, 2007.