"Humanizing Human Computer Interactions"

linksnewsΤεχνίτη Νοημοσύνη και Ρομποτική

18 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

128 εμφανίσεις

Shweta Purushe

Final paper

Human Computer
Interactions


"Humanizing Human Computer Interactions"

How close can we bring humans to computers? The more we bridge the dissimilarity between humans
and computers , the more efficient user interfaces will become,

the faster system processing will become
and the more exceptional situations will human
-
computer interactions manage. Current systems depend
on algorithms and certain mathematical models. However if systems could mimic human emotion in
addition to extant
human intelligence the more efficient designs and implementations will become. This
should culminate into the final challenging of full automation. Emotion and physiological states of the
user are extremely important especially in the fields of medicine, h
ealth care and security provision. This
paper looks into the prospects of developing such fantastic systems and the e
xciting yet daunting
challenges
that are being faced.


INTRODUCTION

Computers have been able to solve mathematical, algorithmic and several

complex problems

in different
disciplines with mind boggling accuracy and efficiency. A computer could perform far better than a
person of extraordinarily high IQ. However, do we consider the computer
intelligent? Adapting, learning
and taking decisions t
o solve problems are tasks now a machine is capable of doing. But affective state
recognition is an avenue computers are still in a fledgling
phase. The ability to express, recognize, react to
and regulate affective states is a recent addition to the
exist
ent narrow
definition of intelligence by
psychologists and socialists. This addition being correctly called ‘emotional intelligence’.


Although few, there are certain ave
nues where human
-
computer interactions would definitely benefit
from
machines developi
ng the ability to recognize the emotive and affective status of the user and be able
to react to it. For example personal prescription for minor health a
ilments, or understanding the
psychological condition of a criminal by police and psychiatrists.

If I w
ere in air traffic control or were I
pilot itself I would really appreciate my machine recognizing when I am about to doze off, or when I
might feel sick during the flight. Not only this but automatically go on an autopilot in a good climate had
I had a lo
ng tiring day.

May it even be having a friendly computer to celebrate Christmas with when one
is away from
home?

Having a computer recognize one’s affective and emotional state and suggest remedies, if needed, is
always desirable.


Why is this so difficult
?

There are a number of factors that have to be understood from as different disciplines as psychology,
sociology,
anthropology, neuroscience and computer science

[1]
.



The main problem encountered is the definition of affective states and their arousal it
self.
How
are the emotional states of a human and the psychological issues underlying them to be defined to
a system that automatically reco
gnizes these in the user?
[1]



Human emotional intelligence is dependent on a numb
er of modalities such as audio
-
visua
l
inputs, sensory inputs, impulses, conditioned reflexes etc. Trying to mimic

these

along with their
tight coupling
without making them mutually independent is almost close to impossible.



Certain reactions, emotional cues and the mere psychologies of peopl
e are dependent on the
cultures

and their different communications

[1, 2]
.



Context complicates the issue further.

How is a machine to differentiate when the user frowns
momentarily in an attempt to concentrate or when the user frowns to express anger or
di
sapproval?
[1, 2]



A lot of the attention during such studies is on the modalities (sound, touch and sight) and non
verbal communications (voice intonations, facial expression etc). However physiological signals,
for example
a quickened heart rate, a febrile

condition or perspiration due to apprehension are
inputs that require contact with the user. Most monitoring systems are uncomfortable, intrusive or
either very fragile. However they do have the added advantage of lesser noise and higher
accuracy

[1, 2 an
d 3]
.



This outlines the overall general problems faced which need to be solved to bring computers closer to
humans, to
design better and easier user interfaces and man
-
machine interactions that
help us define
a
multimodal automatic system
.

In order to ge
ar up to design such a
system we must consider currently used
modalities;

identify their
flaws and trying surpassing them.


Below is a figure of the
different levels of information processing for marrying different modalities to
achieve the ultimate zenith

of human
-
computer interaction described before.


Figure 1: Different levels of data to design a multimodal system

[2]


Limitations of the Visual modality

On the grounds of a bi
-
dimensional emotion model put forth by researchers facial expression recogni
tion
is extremely important. It could be
either:

Feature
based
:

recognizing different features of the face such as corners of the mouth, the eyebrows and
muscles of the jaw
. [
2]

Region based:

measurement of the motions of particular regions of the face, th
e most prominent being the
eyebrows/eyes region and the cheek/corners of the mouth region

[1, 2]
.

However the approaches and techniques designed so far have been trained on a limited number of
predecided, dramatic prototypes. Only six major emotions
(joy
,
anger,

fear, surprise, happiness and
sadness) are being
used to train these systems

[1]
. I find these very inadequate. One study us
ed short
video clips produced using

an actress
emote these expressions

[2]
. I feel that training systems require
more subtle
guidelines; this does make the task much more difficult, nonetheless it will be far more
realistic. Does an everyday user emote as dramatically as that actress had in those video clips? I certainly
think not.

This is in addition to the extant problems of c
ontext dependency mentioned before, and does not handle
the temporal attribute of data. How does the machine discern the attitude or temperament of the user?


Figure
2:

Exaggerated emotions are being used to train systems for facial
expression
recognition

[
2]







Limitations of the Auditory

modality

Recognition of emotion from the audio
-

inputs make
s

use of ‘prosodic’ information. Prosody has been
defined as the
pitch, duration and the intensity of the utterance

[2]
.

Similar is the scenario with auditory

inputs of voice recognition for training systems.
Once again only six
emotions are being considered for detecting changes in lexical stress, rhythm and intonation.

Similar to
facial expression recognition, auditory data is not context
-
dependent,
classifie
s inputs into too narrow a
range of emotions and does not handle the temporal aspect

[1, 2]
.

In addition to that the training set of
data contain very short sentences spoken in much too exaggerated a manner.

How are questions of
slurred speech, impaired s
peech and
loss of speech (dumb users) to be addressed?

State of the Art

Of course, the situation is not as dire as the above might indicate. Some interesting work is being done,
however very few laboratories are undertaking research
in multimodal systems.
One particular attempt

that I really like was the work of Lawrence S. Chen and

Thomas S. Huang

[4]
.

Far from a multimodal
system, this study aims to couple only the modalities of sight and sound.

They put forth a simple
algorithm to integrate both audio an
d visual data.



They calculated a simple pitch contour for audio data
. The speech signal was broken into analysis
windows and a number of pitch parameters were calculated for each window.



For video data, optical flow for the mouth and eye regions, Fourier
Transform and Hidden
Markov Models were used.




This was the algorithm they put forth

[4]
.




At the weighting matrix, all the
relative

proba
bilities

for each of the emotions
are

combined and
the emotion having the highest final deciding number

is the recog
nized proportion.

Although their

attempt was commendable issues of confusion arose in many areas. For example in the
figure below,

on the
right

it is possible to distinguish

happy, angry, surprise and fear from sad and
dislike, but difficult to distingui
sh between sad and dislike.

For the video data on the
left
, it was difficult
to diff
erentiate between
dislike and anger.



Figure 3: Different emotions in videos being used to
train the system

[4]

Figure 4
: Graph of the pitch of every emotion from

audio inputs

[4]




This confusion also puts forth the suggestion that humans, in the real world
,

rely on a piece of
information and combine their inputs from both auditory and visual (and others) to communicate.



On similar lines it could be argued from th
is research, that from the graph on the left, we could
differentiate between sadness and dislike from the others. Further to differentiate between these
two

we could rely on the data that videos provide. As indicated in Figure 3 we could distinguish
betwee
n
sadness and dislike.



However the speaker and actor were Spanish,
the aspect of
it being applicable to different
cultures and populations is ignored in this study.



Just studying two modalities involved such difficulty, therefore efficient multimodal syste
ms are
a daunting challenge to overcome.

Nonetheless,

there definitely has been progress. A
brief description of a system
being
used by the
military.

This system makes use of many of the flowing
gadgets and is scalable right from a hand
help device to a wa
ll sized interface that interoperates between multiple platforms

[5]
.

An interface





Natural Language agent

A speech recognition agent



Text
-
to
-
speech agent

A gesture recognition agent



Simulation agent

A multimodal integration agent




Web display
agent

As the names describe, all these agents interplay to produce a system that has been used in
medical services which allows users to use gestures and speech to find the required health care
providers.

This request is translated by the multimodal integr
ation system to a query into a
database of doctors. Corresponding icons are displayed on the map allowing choice and selection

[5]
.

Although this is significant progress, affective state recognition of the user seems to be missing.
That is the direction re
search of human computer interactions should move.






CONCLUSION

Automatic multimodal systems are extremely simple to fathom but research has proved that achievement
of this goal is going to be rather difficult.

During human communication, humans perceiv
e several
modalities, contexts and
body movements. However the reaction is the result of an integrated processing
of all this input. Therefore I feel that
a certain message is not conveyed by voice or facial expression
alone, but a combination of these. At
tributing the contribution of each of the participating modalities to
the final reaction or response is something I feel that deserves more attention.
Humans detect a certain cue
or message from their communicators
and not
from
a particular modality indepe
ndently.

Humanizing human computer interactions may also introduce certain privacy and dependency issues.
Devising such systems must also entertain the idea of th
ese systems being non
-
intrusive and

non
-
threatening to humans.

During work, blaming any flaws
on the machine will become tempting and
simple. One might argue with the machine, but finally who wins, man or machine?

References:

Maja Pantic


Leon , J. M. Rothkrantz “
Toward an Affect
-
Sensitive Multimodal

Human

Computer
Interaction

Proceedings Of The I
eee,


91
, No. 9, September 2003


Jaimes A, Sebe N.

Multimodal Human Computer Interaction: A Survey”

Computer Vision and Image
Understanding

108
: Issue 1
-
2 October 2007


Bartlett, Marian Stewart


Littlewort, Gwen


Fasel, Ian


Movellan, Javier R.



Real Tim
e Face Detection
and Facial Expression Recognition: Development and Applications to Human Computer Interaction.”
Computer Vision and Pattern Recognition Workshop, 2003. CVPR
W '03.

June 2003


Chen, L.S.; Huang, T.S.; Miyasato, T.; Nakatsu, R.;

“Multimodal human emotion/expression

recognition”
Automatic Face and Gesture Recognition
, 1998. Proceedings. Third IEEE International
Conference

366
-
371

1998.


Cohen et al. “QuickSet: multimodal interaction for distributed applications”

Proceedings of the fifth ACM
international conference on Multimedia
1997