"Humanizing Human Computer Interactions"

linksnewsΤεχνίτη Νοημοσύνη και Ρομποτική

18 Οκτ 2013 (πριν από 4 χρόνια και 8 μήνες)

173 εμφανίσεις

Shweta Purushe

Final paper

Human Computer

"Humanizing Human Computer Interactions"

How close can we bring humans to computers? The more we bridge the dissimilarity between humans
and computers , the more efficient user interfaces will become,

the faster system processing will become
and the more exceptional situations will human
computer interactions manage. Current systems depend
on algorithms and certain mathematical models. However if systems could mimic human emotion in
addition to extant
human intelligence the more efficient designs and implementations will become. This
should culminate into the final challenging of full automation. Emotion and physiological states of the
user are extremely important especially in the fields of medicine, h
ealth care and security provision. This
paper looks into the prospects of developing such fantastic systems and the e
xciting yet daunting
that are being faced.


Computers have been able to solve mathematical, algorithmic and several

complex problems

in different
disciplines with mind boggling accuracy and efficiency. A computer could perform far better than a
person of extraordinarily high IQ. However, do we consider the computer
intelligent? Adapting, learning
and taking decisions t
o solve problems are tasks now a machine is capable of doing. But affective state
recognition is an avenue computers are still in a fledgling
phase. The ability to express, recognize, react to
and regulate affective states is a recent addition to the
ent narrow
definition of intelligence by
psychologists and socialists. This addition being correctly called ‘emotional intelligence’.

Although few, there are certain ave
nues where human
computer interactions would definitely benefit
machines developi
ng the ability to recognize the emotive and affective status of the user and be able
to react to it. For example personal prescription for minor health a
ilments, or understanding the
psychological condition of a criminal by police and psychiatrists.

If I w
ere in air traffic control or were I
pilot itself I would really appreciate my machine recognizing when I am about to doze off, or when I
might feel sick during the flight. Not only this but automatically go on an autopilot in a good climate had
I had a lo
ng tiring day.

May it even be having a friendly computer to celebrate Christmas with when one
is away from

Having a computer recognize one’s affective and emotional state and suggest remedies, if needed, is
always desirable.

Why is this so difficult

There are a number of factors that have to be understood from as different disciplines as psychology,
anthropology, neuroscience and computer science


The main problem encountered is the definition of affective states and their arousal it
are the emotional states of a human and the psychological issues underlying them to be defined to
a system that automatically reco
gnizes these in the user?

Human emotional intelligence is dependent on a numb
er of modalities such as audio
inputs, sensory inputs, impulses, conditioned reflexes etc. Trying to mimic


along with their
tight coupling
without making them mutually independent is almost close to impossible.

Certain reactions, emotional cues and the mere psychologies of peopl
e are dependent on the

and their different communications

[1, 2]

Context complicates the issue further.

How is a machine to differentiate when the user frowns
momentarily in an attempt to concentrate or when the user frowns to express anger or
[1, 2]

A lot of the attention during such studies is on the modalities (sound, touch and sight) and non
verbal communications (voice intonations, facial expression etc). However physiological signals,
for example
a quickened heart rate, a febrile

condition or perspiration due to apprehension are
inputs that require contact with the user. Most monitoring systems are uncomfortable, intrusive or
either very fragile. However they do have the added advantage of lesser noise and higher

[1, 2 an
d 3]

This outlines the overall general problems faced which need to be solved to bring computers closer to
humans, to
design better and easier user interfaces and man
machine interactions that
help us define
multimodal automatic system

In order to ge
ar up to design such a
system we must consider currently used

identify their
flaws and trying surpassing them.

Below is a figure of the
different levels of information processing for marrying different modalities to
achieve the ultimate zenith

of human
computer interaction described before.

Figure 1: Different levels of data to design a multimodal system


Limitations of the Visual modality

On the grounds of a bi
dimensional emotion model put forth by researchers facial expression recogni
is extremely important. It could be


recognizing different features of the face such as corners of the mouth, the eyebrows and
muscles of the jaw
. [

Region based:

measurement of the motions of particular regions of the face, th
e most prominent being the
eyebrows/eyes region and the cheek/corners of the mouth region

[1, 2]

However the approaches and techniques designed so far have been trained on a limited number of
predecided, dramatic prototypes. Only six major emotions

fear, surprise, happiness and
sadness) are being
used to train these systems

. I find these very inadequate. One study us
ed short
video clips produced using

an actress
emote these expressions

. I feel that training systems require
more subtle
guidelines; this does make the task much more difficult, nonetheless it will be far more
realistic. Does an everyday user emote as dramatically as that actress had in those video clips? I certainly
think not.

This is in addition to the extant problems of c
ontext dependency mentioned before, and does not handle
the temporal attribute of data. How does the machine discern the attitude or temperament of the user?


Exaggerated emotions are being used to train systems for facial


Limitations of the Auditory


Recognition of emotion from the audio

inputs make

use of ‘prosodic’ information. Prosody has been
defined as the
pitch, duration and the intensity of the utterance


Similar is the scenario with auditory

inputs of voice recognition for training systems.
Once again only six
emotions are being considered for detecting changes in lexical stress, rhythm and intonation.

Similar to
facial expression recognition, auditory data is not context
s inputs into too narrow a
range of emotions and does not handle the temporal aspect

[1, 2]

In addition to that the training set of
data contain very short sentences spoken in much too exaggerated a manner.

How are questions of
slurred speech, impaired s
peech and
loss of speech (dumb users) to be addressed?

State of the Art

Of course, the situation is not as dire as the above might indicate. Some interesting work is being done,
however very few laboratories are undertaking research
in multimodal systems.
One particular attempt

that I really like was the work of Lawrence S. Chen and

Thomas S. Huang


Far from a multimodal
system, this study aims to couple only the modalities of sight and sound.

They put forth a simple
algorithm to integrate both audio an
d visual data.

They calculated a simple pitch contour for audio data
. The speech signal was broken into analysis
windows and a number of pitch parameters were calculated for each window.

For video data, optical flow for the mouth and eye regions, Fourier
Transform and Hidden
Markov Models were used.

This was the algorithm they put forth


At the weighting matrix, all the


for each of the emotions

combined and
the emotion having the highest final deciding number

is the recog
nized proportion.

Although their

attempt was commendable issues of confusion arose in many areas. For example in the
figure below,

on the

it is possible to distinguish

happy, angry, surprise and fear from sad and
dislike, but difficult to distingui
sh between sad and dislike.

For the video data on the
, it was difficult
to diff
erentiate between
dislike and anger.

Figure 3: Different emotions in videos being used to
train the system


Figure 4
: Graph of the pitch of every emotion from

audio inputs


This confusion also puts forth the suggestion that humans, in the real world

rely on a piece of
information and combine their inputs from both auditory and visual (and others) to communicate.

On similar lines it could be argued from th
is research, that from the graph on the left, we could
differentiate between sadness and dislike from the others. Further to differentiate between these

we could rely on the data that videos provide. As indicated in Figure 3 we could distinguish
sadness and dislike.

However the speaker and actor were Spanish,
the aspect of
it being applicable to different
cultures and populations is ignored in this study.

Just studying two modalities involved such difficulty, therefore efficient multimodal syste
ms are
a daunting challenge to overcome.


there definitely has been progress. A
brief description of a system
used by the

This system makes use of many of the flowing
gadgets and is scalable right from a hand
help device to a wa
ll sized interface that interoperates between multiple platforms


An interface

Natural Language agent

A speech recognition agent

speech agent

A gesture recognition agent

Simulation agent

A multimodal integration agent

Web display

As the names describe, all these agents interplay to produce a system that has been used in
medical services which allows users to use gestures and speech to find the required health care

This request is translated by the multimodal integr
ation system to a query into a
database of doctors. Corresponding icons are displayed on the map allowing choice and selection


Although this is significant progress, affective state recognition of the user seems to be missing.
That is the direction re
search of human computer interactions should move.


Automatic multimodal systems are extremely simple to fathom but research has proved that achievement
of this goal is going to be rather difficult.

During human communication, humans perceiv
e several
modalities, contexts and
body movements. However the reaction is the result of an integrated processing
of all this input. Therefore I feel that
a certain message is not conveyed by voice or facial expression
alone, but a combination of these. At
tributing the contribution of each of the participating modalities to
the final reaction or response is something I feel that deserves more attention.
Humans detect a certain cue
or message from their communicators
and not
a particular modality indepe

Humanizing human computer interactions may also introduce certain privacy and dependency issues.
Devising such systems must also entertain the idea of th
ese systems being non
intrusive and

threatening to humans.

During work, blaming any flaws
on the machine will become tempting and
simple. One might argue with the machine, but finally who wins, man or machine?


Maja Pantic

Leon , J. M. Rothkrantz “
Toward an Affect
Sensitive Multimodal



Proceedings Of The I

, No. 9, September 2003

Jaimes A, Sebe N.

Multimodal Human Computer Interaction: A Survey”

Computer Vision and Image

: Issue 1
2 October 2007

Bartlett, Marian Stewart

Littlewort, Gwen

Fasel, Ian

Movellan, Javier R.

Real Tim
e Face Detection
and Facial Expression Recognition: Development and Applications to Human Computer Interaction.”
Computer Vision and Pattern Recognition Workshop, 2003. CVPR
W '03.

June 2003

Chen, L.S.; Huang, T.S.; Miyasato, T.; Nakatsu, R.;

“Multimodal human emotion/expression

Automatic Face and Gesture Recognition
, 1998. Proceedings. Third IEEE International



Cohen et al. “QuickSet: multimodal interaction for distributed applications”

Proceedings of the fifth ACM
international conference on Multimedia