The Emotion Mirror: Recognizing Emotion in Speech and Displaying the Emotion in an Avatar

birthdaytestAI and Robotics

Nov 17, 2013 (3 years and 8 months ago)

70 views

Abe Kazemzadeh, Samuel Kim, and Yoonji Kim

Final Project for CSCI 534: Affective Computing

5/2/07

Profs. Gratch and Marsella



The Emotion Mirror:

Recognizing Emotion in Speech and Displaying the Emotion in
an Avatar



Abstract


This paper describes the Em
otion Mirror, a demonstration that recognizes emotion in a
user's utterance and displays the emotion through the facial expressions of an animated
avatar that repeats what the user has said. The demonstration is discussed in terms of
system architecture a
nd component design. In addition to the technical details of the
system we also look at general engineering and theoretical issues that pertain to the demo
system. Finally, describe our experience in creating the system to highlight what worked
and what
did not.



Introduction


Our project aimed to develop an integrated demonstration of several affective computing
technologies. The theme of an emotional mirror served to tie the separate technologies
into a coherent interaction. The emotion in the user
's spoken utterance is "reflected" back
as facial emotion of an animated avatar that speaks back what the user said. Since human
emotion often operates at a subconscious level, it is hoped that the Emotion Mirror may
eventually be used help people reflect

upon their emotions. The intuition behind such an
application comes from two sources. First, it has been postulated that there is a point in a
child's development when they realize their own reflection in a mirror. Second, many
people find it funny or
unusual to hear their recorded voice when played back.



The emotion mirror demonstration consists of a speech and acoustic emotion recognizer,
a text
-
based emotion recognizer that operates on the speech recognition output, and a
facial animation system th
at controls the face and lip synchronization. The figure on the
following page shows a diagram of the overall system architecture. One goal of our
design was to reuse as much existing software and methodology as possible to ensure that
our project was fe
asible in a semester time frame and focused our efforts on the emotion
aspects of the design. The details of each component are discussed in the following
sections and cited in the reference section. After that we look at engineering and
theoretical issu
es that pertain to this demonstration.






























Figure 1
. The input to the system is the user's spoken utterance, which is assumed to be
of an unconstrained vocabulary. The lexical and emotional content are recognized and
the wave
file is saved. The lexical output of this stage (text) is classified into emotional
categories based on word distribution statistics and lexical resources and a final decision
can be made considering the acoustic emotion recognition results (this is not
i
mplemented yet and the ideal way to combine these components is left to future work).
The resulting emotional decision is sent with the text and wave file to the face which
adjusts the facial expression and synchronizes the lips with the wave file. The l
ip
gestures are generated by using the phone sequence from the TTS module. All of the
components are separate processes on the same machine connected by TCP, but the face,
face control, and TTS module are connected by an API provided by the CSLU toolkit t
hat
hides the TCP implementation details. For more information about the separate
components, see the following sections and the reference for proper citation.




Speech Recognition

And

Acoustic Emotion Detection

Text Emotion
Classification

And

Acoustic+Text Decision

Speech

-

Text,

-

Acoustic Emotion (n
eutral/negative),

-

Wave File Name

-

Text w/ Emotion Annotation

-

Emotion Classification


(angry/disgusted/fearful/happy/neutral/sad/surprised)

-

Wave File Name

Face
Control

And

GUI

TTS

(Festival)

-

Lipsync

-

Emotion

-

Wave
file

Text/Lipsync

(tcp)

(tcp)

(tcp)

(api/tcp)

(api/tcp)

Speech and Acoustic Emotion Recognition

In the proposed real
-
time emotion detection system,

we extract and fuse emotional
information encoded at different timescales especially supra
-

and intra
-

frame level, and
lexical level. The rationale behind this approach is that emotion is encoded at different
levels of speech, such as supra
-
frame, intra
-
frame, and lexical level, and each timescale
feature is complementary.


There have been many studies trying to classify emotion states model with each level of
feature. In this work, we first extract emotion encoded in supra
-

and intra
-

frame level
using
acoustic information and later combine with lexical information. MFCC and
statistics of prosody information (pitch and energy) are utilized for extracting emotion in
intra
-

and supra
-

frame level, respectively. The results from speech recognizer are used
f
or extracting emotion in lexical level.


Extracting emotion from acoustic information based on two separated modalities needs to
be combined in order to have a unique decision. Here we used a modified version of
weighted likelihood sum, and details are de
scribed in [Kim, 2007]. For real
-
time
application, we utilize several libraries such as IT++ and TORCH [IT++][TORCH].
Prior
to extracting textual emotion, speech recognition is required to make a word sequence.
SONIC speech recognizer based on Hidden Marko
v Model (HMM) is utilized for this
application [SONIC]. It is developed for continuous speech recognition at University of
Colorado, Boulder.



Textual Emotion Recognition


The benefit of analyzing the emotion in the text of the speech recognizer output i
s that it
is possible to use aspects of an utterances meaning. The approached used here use a
shallow, lexical approach to extracting meaning that can help extract emotional
information. Other deeper ways to analyze the meaning of an utterance could prov
ide a
richer description of the utterance's emotion, but there is a trade
-
off between depth of
analysis and breadth of coverage. Since we wanted to allow unconstrained user input, we
opted for a shallow analysis with broad coverage.


The first method we t
ried was using Cynthia Whissell's Dictionary of Affect in Language
(DAL) (Whissell 1986, Whissell 1989). This dictionary provides a table of 9000 words
(the references quote 4000 words, but the dictionary that was provided to us by Whissell
contains almos
t 9000 entries) with values for three dimensions: valence, activation, and
imagery (also, in the references we cite there is no mention of the imagery dimension),
which are normalized to a range of 1 to 3. The dictionary was compiled by ratings
provided b
y experimental subjects and evaluated in repeat experiments for validity.
Some attested uses are evaluating the emotional tone of texts, authorship attribution, and
discourse analysis. One limitation noted in (Whissell 1989) is that the activation
dimens
ion has lower reliability. This is interesting because the acoustic emotion
recognition shows an opposite tendency where the valence is less reliable. Therefore,
these two modalities have a potential to complement each other. However, we did not
end up
using the features from the DAL in our demonstration. The main inputs to the
face's emotion controls were categorical emotions. In future work, it may be possible to
control the face on a word level, instead of a sentence level. With this increase in
re
solution, it may be possible to use the DAL features to control emotions at the word
level.


The second method we tried was using a document retrieval technique known as TFIDF
(term frequency * inverse document frequency) (Salton and Buckley 1988). This
a
pproach uses the intuition that words (terms) that both occur frequently in a document
but also infrequently across documents are relevant terms for queries. The frequency of a
term in a document, term frequency, is captured by the
tf

component. The infr
equency of
a terms occurrence across documents, inverse document frequency, is captured by the
idf

component. Our use of this approach extended the metaphor of document query to
emotion categorization by considering the utterance's lexical content to be t
he query and
an ordered list of relevant emotions to be the documents returned. To do this, we created
lists of hypothetical emotional utterances in documents for each of the 7 emotions
accepted by the face (angry, disgusted, fearful, happy, neutral, sad,

surprised). This
approach gave us initial data, but is still very sparse and does not provide a rigorous base
for the categorization. With more data, it is reasonable to assume that performance will
improv by virtue of more coverage as well as making n
-
gram features possible.
Currently only unigrams are used. One way we circumvented lack of data was by using
the stems of words as features for the TFIDF analysis, rather than the word tokens. The
stemming was performed using the WordNet resource and is
implemented by a perl
library available on CPAN (Pedersen and Banerjee 2007). This reduced the amount of
words to maintain statistics for, but slowed down the performance, especially in the word
statistics learning phase. Another method, the Porter stemmi
ng algorithm offered a better
speed of performance, but used a more naive method. Since the slowness of the
WordNet stemming was less noticeable in the querying phase, we chose it over the Porter
stemming algorithm. However, more empirical evidence is ne
eded to advocate one over
the other.


Another tradeoff in the TFIDF approach is the representation of emotions as documents.
In our demonstration the documents were lists of approximately 100 sentences. An
alternative would be to have each sentence be co
nsidered a document. The tradeoff can
be seen in the two terms of the TFIDF equation. When using the first method, one
emotion per document, the estimation of
idf

becomes very discrete since it can only take
one of eight values (0/7, 1/7, 2/7, ... 7/7).

However, when using the second method, one
document per sentence, the
tf

term becomes very discrete because since the number of
words per sentence is relatively small, we get the same problem of a small denominator
in the
tf

term. One possible way to get

around this would be to use multiple neutral
documents. This would take advantage of the fact that there is more neutral data
available, so the estimation of
idf
would improve while still keeping the document size
large enough to estimate
tf
.


One way we

want to improve the textual emotion recognition is by getting more data for
use in the emotional documents and for empirical evaluation of this approach. With more
data in the emotional documents, higher order n
-
grams could be used. Also with
empirical
evaluation we could make meaningful assertions about the strengths and
weaknesses of this methodology. One possible way to get more data is by using the
demonstration to gather user data and learn from this input. Another opportunity for
furthering this
research is to work towards better integration with other modalities. Our
approach was geared towards the CSLU face. With other modalities we might have
considered other emotional representations, perhaps more along the lines of the DAL.
Also more data

would help here by allowing us to determine optimal ways to combine
the textual and acoustic emotional recognition components. The ultimate improvement to
textual emotion recognition would be true understanding of semantics, context, agent
reasoning, com
bined with a dynamic emotion model, so beyond our initial approach there
are ample opportunities for further work.



Facial Animation of Spoken Emotions


For the facial expression, there is a small number(six) of basic emotions noted as discrete
emotion th
eory.(Ekman; Izard; Tomkins) The six emotions are
angry, disgusted, fearful,
happy, sad and surprised.

Neutral is added to CSLU for representing default state. This is
only for default expression. Each basic emotion is genetically determined, universal, an
d
discrete. In this project, we followed the same as this.


We get
the
emotion from
the
speech recognizer at first and decide final emotion as the
input of our program through text analysis. The inputs consist of text of sentence,
emotion, and audio file
as a text file type. We read it and apply the animation.
The

difference

from the initial system is that emotion is represented at real time for every
speech and

also the lip synchronization is performed automatically, without the user
explicitly entering t
he text.





Engineering and

Theoretical Issues


The overarching theoretical issue that subsumes the emotion mirror demonstration the
representation and modeling of human emotion. Closely related to this is the engineering
issue of measuring, categorizing
, and synthesizing emotions. It may be possible to have
an ad hoc way of measuring, categorizing, and synthesizing emotions, but without a
theoretical motivation, these applications will not have sufficient generality. Without
taking into account theoret
ical considerations, applications will not be able to generalize
across different input and output modalities, interpersonal idiosyncrasies, situational
context, and cross
-
cultural differences. Conversely, without reliable and precise ways of
measuring, c
ategorizing, and synthesizing emotions, theoretical issues cannot be
empirically tested. One of the important themes of affective computing is the
complementary nature of theoretical and engineering issues.


The way that our demonstration tackled the theo
retical issue of representing emotions
was by using different emotional representations for different modalities in a principled
way. Research has shown [reference] that emotions expressed in speech contain reliable
information about the activation dimens
ion of emotions, but are less reliable predictors of
valence. Therefore, in previous studies, neutral and sad emotions are reliably
differentiated from happy and angry, but within these groups the acoustic measurements
provide less discrimination. The le
xical and semantic aspects of language contain
information that is useful in capturing fine detail about emotions from different
connotations of similar words and through descriptions of emotion
-
causing situations.
However, the richness of language can ca
use ambiguities and blur distinctions between
emotional categories. Facial expression of emotion has the benefit of providing clear
emotional categories in ways that a human subject can reliably identify. However,
automatic recognition of facial emotions

can be confounded by different camera angles,
individual physiognomy, and concurrent speech.



By combining acoustic and lexical information, we made it possible to get both activation
measures and categorization as well. Moreover, a technical problem in

speech
recognition is that emotional speech differs from neutral speech and can cause
recognition errors. Using both acoustic and lexical modalities could circumvent this
problem if unreliable recognition could be detected, perhaps by a language model sc
ore.
Also, the fact that we used the speakers’ utterance for the animated character allowed this
error to be minimized to lip synch errors. The TFIDF approach provided a mapping
between the expressiveness of emotion in an utterance's meaning and the cate
gorical
input to the facial animation.


Another theoretical issue of this demonstration is the psychological idea of a mirror to
provide introspection and self
-
awareness. Since emotions are often sub
-
conscious
[reference], such an application may have a

therapeutic use in helping people "get in
touch with" their emotions. Hypothetically, this could increase emotional expressiveness
in shy people and decrease it in obnoxious or hysterical people. Whether it could actually
be such a psychological panacea

is a question that need more work. Related to this is the
theoretical notion of philosophy of mind and other
-
agent reasoning. This deals with the
question of how people understand the actions and motives of others. It is an open
question whether people

understand other agents' actions though putting themselves in
the other agents’ shoes, or by some type of causal reasoning about psychological states.
Experiments using variations on the emotional mirror could provide insight into this
issue.


Some pos
sible applications of the technology behind this demonstration include
automatic animation, actor training, call center training, therapy for introspection or
interpersonal skills, indexing multimedia content, and human
-
centered computing. For
automatic a
nimation, this technology would both lip synch and emotion synch and could
save animators time. Corrections and extra artistic effects could modify or add to the
automatically generated emotions. Actors could use this demonstration to practice their
line
s. This would allow them to see the emotions that they are actually conveying and it
could make practicing more fun. Similarly, for patients that have trouble conveying
emotion. Perhaps they convey too little emotion, in the case of shy people, or to mu
ch
emotion, in the case of hysterics or hot tempered people. Shy people might benefit from
attempting to act emotional around a computer, with which they might be less intimidated
than with a human friend or psychiatrist. People who over
-
display emotion
may benefit
from seeing their emotion being displayed back to them, making them realize how they
appear to others. (Tate and Zabinski 2003) report on the clinical prospect of using
computers in general for psychological treatment, and there is of course t
he famous
ELIZA program that is realized today in the Emacs Psychiatrist (Weizenbaum 1966).


Finally, as with many affective computing applications there can be ethical concerns in
this demonstration. It has been noted that users may be more abusive of vi
rtual
characters than real people. Furthermore, to enable the demonstration to recognize anger
it may be necessary to have the system recognize swear words and racial epithets. The
laws of unintended consequences may be realized in many unpleasant ways.

References



[SONIC]
http://cslr.colorado.edu/beginweb/speech_recognition/sonic.html


[Kim, 2007] S. Kim, P. Georgiou, S. Lee, and S. Narayanan, “Real
-
time Emotion
Detection
System using Speech: Multi
-
modal Fusion of Different Timescale Features,”
submitted to MMSP 2007
.


[IT++]
http://itpp.sourceforge.com/


[TORCH]
http://www.torch.ch/


[CSLU
] http://cslu.cse.ogi.edu/toolkit/


Ekman, P. (1972). Universals and cultural differences in facial expression of emotion. In
J. R.Cole (Ed.),
Nebraska Symposium on Motivation: Vol. 19
(pp. 207
-
283). Lincoln:

University of Nebraska Press.


Ekman, P. (1984
). Expression and the nature of emotion. In K. R. Scherer & P. Ekman
(Eds.),

Approaches to emotion
(pp. 319
-
344). Hillsdale, NJ: Erlbaum.


Ekman, P. (1992). An argument for basic emotions.
Cognition and Emotion, 6
(3
-
4), 169
-
200.


Izard, C. E. (1971).
The f
ace of emotion.
New York: Appleton
-
Century
-
Crofts.


Tomkins, S. S. (1984). Affect theory. In K. R. Scherer & P. Ekman (Eds.),
Approaches to

emotion
(pp. 163
-
196). Hillsdale, NJ: Erlbaum.


Whissell, Cynthia, "The Dictionary of Affect in Language" in
Emotion
:Theory, Research,
and Experience
. v. 4. Eds. Robert Plutchik and Henry Kellerman. Academic Press, 1989.



Whissell, C.M., Fournier, M., Pelland, R., Wier, D., and Makarec, K. "A Dictionary of
Affect in Language: Reliability, validity, and applications.
Perceptual and Motor Skills
,
62, 875
-
888. 1986.


Salton, G. and Buckley, C. 1988 Term
-
weighting approaches in automatic text retrieval.
Information Processing & Management 24(5): 513

523.


Fellbaum, Christiane (Ed.),
WordNet: An Electronic Lexical Databas
e
. MIT Press, 1989.


Ted Pedersen, and Satanjeev Banerjee, http://search.cpan.org/~sid/WordNet
-
Similarity
-
1.04/lib/stem.pm, accessed Mar. 2007.


Jason Rennie, http://search.cpan.org/~jrennie/WordNet
-
QueryData
-
1.45/QueryData.pm,
accessed Mar. 2007. (require
d by stem.pm)


Ted Pedersen, Siddharth Patwardhan, Jason Michelizzi, and Satanjeev Banerjee,
http://search.cpan.org/~sid/WordNet
-
Similarity
-
1.04/lib/WordNet/Similarity.pm,
accessed Mar. 2007. (required by stem.pm)


Benjamin Franz and Jim Richardson, http:
//search.cpan.org/~snowhare/Lingua
-
Stem
-
0.82/lib/Lingua/Stem.pod, accessed Mar. 2007.


Deborah F. Tate and Marion F. Zabinski. "Computer and Internet applications for
psychological treatment: Update for clinicians". in
Journal of Clinical Psychology
.
Volum
e 60, Issue 2 , Pages 209
-

220. 2003.


Joseph Weizenbaum. ELIZA
--
A Computer Program For the Study of Natural Language
Communication Between Man and Machine. Communications of the ACM Volume 9,
Number 1 (January 1966): 36
-
35.