Emotion Recognition Through Body Language for Human Robot Interaction

fencinghuddleAI and Robotics

Nov 14, 2013 (3 years and 8 months ago)

222 views

Emotion Recognition Through Body
Language for Human Robot Interaction
The Maersk Mc-Kinney Moller Institute
Faculty of Engineering
University of Southern Denmark
handed in by
Lilita Kiforenko
born:29.03.1989
10.06.2013
Supervisor:Dirk Kraft
Contents
Contents 2
List of Figures 4
List of Tables 6
1 Introduction 7
1.1 Problem Statement and Solution.................7
1.2 Report Outline...........................8
2 State of the Art 9
2.1 Emotions in Psychology......................9
2.2 Automatic Solutions........................11
3 Prerequisites 15
3.1 Microsoft Kinect For Windows...................15
3.1.1 Kinect sensor........................15
3.1.2 Microsoft Kinect SDK 1.6.................16
3.1.3 Real-Time Human Body Joint Acquisition........16
3.2 Sequence Classification.......................18
3.2.1 Dynamic Time Warping..................19
3.2.2 Support Vector Machines..................20
3.3 Data Visualization:Self-Organising Map.............24
4 System Description 25
4.1 Data Acquisition..........................26
4.1.1 Emotion Database.....................26
4.1.2 Recording System......................28
4.1.3 Skeleton Improvement...................29
4.1.4 Time Series Segmentation.................31
4.1.5 Sequence Labelling.....................32
4.2 Data Preprocessing.........................35
4.2.1 Equalizing the Length of the Emotion Videos......35
4.2.2 Meta-Feature Extraction..................35
4.3 Data Classification.........................40
5 Experiments and Analysis 41
5.1 Recorded Data — Observation/Evaluation............41
5.2 Dynamic Time Warping versus Simple Video Stretching.....48
5.2.1 Raw Data Result Discussion................49
5.2.2 Meta-feature Result Discussion..............49
5.3 Support Vector Machines Parameter Selection..........50
2
CONTENTS 3
5.3.1 Raw Data Result Discussion................51
5.3.2 Meta-Feature Result Discussion..............53
5.4 Results................................54
5.5 Data Observation..........................61
6 Conclusion 64
A Appendix:Algorithms 67
B Appendix:Attached DVD Content 71
Bibliography 72
List of Figures
2.1 Robert Plutchik’s wheel of emotions and 10 postulates.....11
3.1 Kinect hardware..........................16
3.2 Joint positions and abbreviations.................17
3.3 Joint hierarchy...........................17
3.4 Depth image features........................18
3.5 Global Constrains..........................20
3.6 An example of problem definition.................21
3.7 Different size margin example...................21
3.8 A maximum-margin hyperplane..................22
3.9 Class separation problems.....................23
4.1 The process sequence........................25
4.2 Project set-up............................26
4.3 Recording system initial window..................29
4.4 Turtle game screenshot.......................30
4.5 Hand position improving sequence diagram............31
4.6 Hand position improvement examples...............32
4.7 Possible emotion cut example...................33
4.8 Collected data............................34
4.9 DTWresult cost matrix......................36
4.10 DTWand SVS result........................36
4.11 Joint Combinations.........................39
5.1 Boredom expressed.........................43
5.2 Joy expressed............................44
5.3 Curiosity expressed.........................45
5.4 Confused expressed.........................46
5.5 Disgust expressed..........................47
5.6 Video length equalizing method evaluation............48
5.7 SVM parameter selection flow...................50
5.8 Polynomial kernel optimal parameter search results.......52
5.9 RBF kernel optimal parameter search results...........52
5.10 Meta-feature RBF kernel optimal parameter search results...53
5.11 Individual joint classification results................54
5.12 Individual feature from Posture Group classification results...55
5.13 Individual feature from Limb Rotation Movement Group clas-
sification results...........................56
5.14 Individual feature from Posture Movement Group classification
results................................56
5.15 Overall results............................58
5.16 Combined result confusion matrix.................59
4
LIST OF FIGURES 5
5.17 Emotion recognition result.....................59
5.18 Results for identifying each emotion................61
5.19 New person classification accuracy.................62
5.20 Raw data U-matrix.........................63
List of Tables
2.1 A Selection of lists of basic emotions...............12
2.2 List of related works........................14
4.1 Example input...........................35
4.2 Posture Group definition......................38
4.3 Limb Rotation Movement Group definition............39
4.4 Posture Movement Group definition................39
5.1 DTWdistance test results.....................48
5.2 DTWSakoe-Chiba band test results...............49
5.3 Emotional sequence length evaluations..............50
5.4 Meta-feature SVM optimal parameter search results.......53
5.5 Joint selection results........................57
5.6 Meta-feature selection results...................58
6
Emotion - from Old French esmovoir to excite,
from Latin ¯emov¯ere to disturb,from mov¯ere to
move.A psychological state that arises spon-
taneously rather than through conscious effort
and is sometimes accompanied by physiological
changes
Collins English Dictionary
1
Introduction
Every year technology gets more powerful and cheaper to acquire.The mobile
phones era with calls and SMS is rapidly being replaced with smart-phones,
that slowly become more intelligent and human oriented.The focus now is
on creating technology that understand humans,that is personalised to the
end user.New smart TVs use a camera that allows for easier communication
between user and machine.I believe it is not long until such TVs will be able
to"understand the user"and make predictions of what he will like.To achieve
this the studies of human mood and emotions are needed.
Human care technology is not a new area.Projects about improving human
care are increasing and there are needs for care robots.The development of
different small robots is also increasing,from the robot vacuum cleaner to the
grass cutters.They are present almost in every aspect of everyday life.Robot
pets,that are developed for human interaction,for example,AIBO dogs by
Sony.Adding a human emotion recognition system to these robots,would
greatly improve the quality of them.Having a small iRobot vacuum cleaner
not to vacuum next to you when you are annoyed,wouldn’t it be wonderful?
Emotion recognition is an important feature when developing communication
between artificial systems and humans.No matter if it is an avatar or a
robotic platform,everywhere where there is a need to interact in some way
with humans,systems will benefit greatly if they can perceive human emotions.
1.1 Problem Statement and Solution
Human emotion studies are not a new area,but still a lot is unknown and hu-
man emotions are not easy to use in everyday life with the level of technology
we have today.Typically,human emotions are recognised using electroen-
cephalography,face and speech.There is not much done in emotion recogni-
tion from body movements and postures.And with cheap depth cameras on
the market,that are easily connected to our TVs or any robotic device,the
7
1.2.REPORT OUTLINE 8
idea is to use these to recognise human emotions from body movements and
postures.This is the main focus of this research.
The research hypotheses are:
1.Humans express emotions while experiencing different visual input.
2.Human body joint angles extracted using a Microsoft Kinect for Windows
in the following called Kinect and the Microsoft Kinect Software Devel-
opment Kit 1.6 contains enough information to classify human emotions.
Through this work an approach for emotion recognition using body movements
and postures is presented.The first problem that needs to be solved is how to
acquire an emotional dataset.The proposed solution is to use Kinect to record
such a dataset.And to trigger real human emotions different visual stimuli
are used.The next problem is how to get the emotion expressions from the
entire recording.This could be done by analysing quantity of movement.To
be able to do so,a joint position of human body needs to be extracted.That
can be done by existing libraries for Kinect.After emotion sequences are
collected a problem of sequence classification arise.The possible solution is to
use Support Vector Machines to classify sequences.Joint rotation angles or
specially calculated features can be used as an input for them.
1.2 Report Outline
This report discuss the mentioned problem,possible solutions and provides
results.
The structure of the report is as follows:
Chapter 1:Describes the problem,defines hypotheses and presents the re-
port outline.
Chapter 2:Is focusing on emotion recognition,when were the first attempts
to identify them,what has been done in the area of psychology in emo-
tion recognition,the proposed automatic solutions and their recognition
success.
Chapter 3:Explains the theory that is needed to understand the proposed
solutions.
Chapter 4:Describes the used methods that have been applied to solve the
problem.
Chapter 5:Shows the achieved results.
Chapter 6:Concludes the work and describes results.
2
State of the Art
This chapter discuss the milestones in emotion recognition,starting fromfacial
emotion recognition.It explains different emotional models and which emo-
tions tend to be classified.The chapter ends with evaluating the developed
automatic solutions for emotion classification and their performance.
2.1 Emotions in Psychology
To my knowledge the earliest well known work concerning human emotions
was made by Charles Darwin and published in 1872.Darwin in his work
mentions the idea that distinguishing emotions has been an important topic
for many scientists.He cites another illustrious physiologist Müller that said
"...according to the kind of feeling excited,entirely different groups of the
fibres of the facial nerve are acted on"[9].These statement leads to the
conclusion that it is possible to categorise different emotions based on human
facial expressions.
Darwin writes that human emotions also depends on human habits.For ex-
ample,going out the door a person unconsciously is wearing gloves.This is
something that was taught to that person from childhood,it is his repeatable
habit.To find out if that person is acting this way,because an"emotion"forces
him to do so,or is it because he was"taught"to do exactly this movement is
an unreachable goal.That is why scientists try to search for basic (primary,
true,fundamental)
1
emotions.The whole idea behind basic emotions is that if
people around the world are expressing the same emotion in the same way (we
can find the same features) then these are instinctive emotions,they do not
consciously control their muscles,so these are the ones we can categorise,we
have them to deal with fundamental life-tasks [11].Darwin wrote"...when-
ever the same movement of the features or body express the same emotions in
several distinct races of man,we may infer with much probability,that such
1
Different literature uses different terms
9
2.1.EMOTIONS IN PSYCHOLOGY 10
expressions are true ones,that is,are innate or instinctive"[9].Another type
of actions that we need to think about while working with emotions according
to Darwin are reflex actions.For example coughing,sneezing,closing your
eyelid when your eye gets touched.Some reflex actions are hard to distinguish
from habits and most of them are also performed unconsciously.
Other well known research was started in 1968 by Paul Ekman,who tried
to find an answer to"if human facial behaviours associated with emotion are
universal or culture specific".As Darwin,Ekman searched for basic emotions
and found six of them.He proposed nine characteristics that distinguish basic
emotions,for example,quick onset and brief duration.The six emotions are:
anger,fear,enjoyment,sadness,disgust,surprise.Ekman does not dismiss
the possibility of having additional basic emotions.For example,Izard et al.
[20] proofed that interest is also a basic emotion.
One of the characteristics that,according to Ekman et al.[11] distinguish emo-
tion is brief duration,he provides evidence,that humans experience emotion
for seconds,not minutes or hours.This characteristic is also used to be able
to separate emotions from moods,that can last for hours.Scherer et al.[36]
reported that emotions last between 5 sec to several hours,but their results
are based on asking the participants.Ekman et al.[11] suggested that this is
not a valid way,because a person can experience the same emotion,but it is
just triggered multiple times.
Plutchik [32] proposed another emotional model,that consist of 8 basic emo-
tions and others occur as combinations of the eight basic ones.He demon-
strated it by creating a wheel of emotions and writing ten postulates about his
psycho-evolutionary theory.It is shown in Figure 2.1.
Many researchers agrees that there is a small set of basic emotions,but about
the size and which emotions it contains,they have different opinions [30].
Ortony and Turner [30] summarised the proposed basic emotions fromdifferent
works,it is shown in Table 2.1.There is a lot of disagreement between theorists
(researchers),and some of the emotions have different names that mean the
same thing,for example,joy and happiness or fear and anxiety.Most of them
include such emotions as anger (rage),joy (happiness),sadness,fear (anxiety),
disgust,surprise.These emotions are the basis for my research.
Each emotion can also be expressed in multiple ways,research by Ekman and
Friesen [14] proposed to consider basic emotions as families,for example,there
can be a family of angry emotions.In this research they found 60 different ways
to show anger,but all these expressions"...share certain core configuration
properties"[12,p.386]
The research on emotions in psychology is mostly based on facial expressions in
still images.Nevertheless there are multiple attempts to evaluate a sequence of
actions,for example,Keltner et al.[23] proofs that there exists a sequence of
movements for an embarrassment emotion,with a duration of 5 sec.Ekman et
al.[12] disagrees that such sequences will be uniformand the same world-wide.
2.2.AUTOMATIC SOLUTIONS 11
Figure 2.1:Robert Plutchik’s wheel of emotions and 10 postulates,taken from [2].
In newer works psychologist are focusing more on voice and the human body.
Several extensive studies have been conducted about bodily expression of emo-
tion.
"While some studies have found evidence for specific body movements accom-
panying specific emotions,other indicate that movement behaviour (aside from
facial expression) may be only indicative of the quantity (intensity) of emotion,
but not of it quality"[39].In Ekmans and Friesens’ [13] opinion,body move-
ment expresses the intensity of the emotion,while some other researchers have
shown that there exists a distinctive body movement or posture that helps to
recognise specific emotions [35,13,6].Wallbott’s [39] research was focusing
on the analysis of body movements and postures with reference to a specific
emotion.He provided evidence that there are specific movements and postures
that are associated with different emotions.
2.2 Automatic Solutions
Emotion recognition is not a new topic.Many attempts have been done to
be able to recognise emotions automatically,for example,making recording
or real time analysing human facial expressions,body movements,postures or
speech.
Also electroencephalography (EEG) is widely used in emotion recognition.
Using EEG researcher’s could achieve 66.7 % recognition for three emotional
2.2.AUTOMATIC SOLUTIONS 12
Reference Fundamental Emotion
Arnold (1960) Anger,aversion,courage,dejection,desire,
despair,fear,hate,hope,love,sadness
Ekman,Friesen,Ellsworth (1982) Anger,disgust,fear,joy,sadness,surprise
Frijda (1986) Desire,happiness,interest,surprise,wonder,
sorrow
Gray (1982) Rage and terror,anxiety,joy
Izard (1971) Anger,contempt,disgust,distress,fear,
guilt,interest,joy,shame,surprise
James (1884) Fear,grief,love,rage
McDougall (1926) Anger,disgust,elation,fear,subjection,
tender-emotion,wonder
Mowrer (1960) Pain,pleasure
Oatley,Johnsonlaird(1987) Anger,disgust,anxiety,happiness,sadness
Panksepp (1982) Expectancy,fear,rage,panic
Plutchik (1980) Acceptance,anger,anticipation,disgust,joy,
fear,sadness,surprise
Tomkins (1984) Anger,interest,contempt,disgust,distress,
fear,joy,shame,surprise
Watson (1930) Fear,love,rage
Weiner,Graham (1984) Happiness,sadness
Table 2.1:A Selection of lists of basic emotions,taken from [30].
states:pleasant,neutral,and unpleasant.Up to 78.04% for negative emotions
such as sad and disgust [34,38].A lot of research is done using EEG and
results are optimistic.The problem with that is that in the best scenario you
need to use a specially made"sensor cap",and this scenario is not applicable
in everyday life.
Most of the research in human emotion recognition is based on human facial
expression and speech.Facial emotion recognition proves to be very successful,
by achieving results of an average 93.2% classification rate of neutral,happy,
surprised,angry,disgusted,afraid,sad [4].Alot of work is thereby relying on a
cheap web camera.Also databases exist that provide many images of different
people expressing different emotions [3].Speech analysis also shows a good
performance in emotion recognitions,by achieving a 80.60% for such emotions
as boredom,neutral,anger,fear,happiness,sadness,disgust classification from
Berlin Emotion Database."The average classification accuracy of speaker-
independent speech emotion recognition system is less then 80% in most of
the proposed techniques.For speaker dependent classification,the recognition
accuracy exceed 90% only in few studies"[15,p.584].While there are studies
on which human facial features are most distinguishing between emotions,
there is no clear feature definition for speech.
Cameras have gotten more accessible and easier to use in everyday life,espe-
2.2.AUTOMATIC SOLUTIONS 13
cially in the entertainment area,so the focus has also been on analysing human
body movements and postures.Many attempts have been made to machine
classify emotions from body movements and posture.
Most of the research in body movements is based on acted emotions,where
actors are asked to play different emotions.Only few attempts to my knowl-
edge are done based on real human emotions.Table 2.2 shows the summary
of automatic solutions for emotion recognition using human body movements
and postures.These works are used as inspiration and/or comparison by clas-
sification to this work.
2.2.AUTOMATIC SOLUTIONS 14
Author
Emotions
Scenario
Body position
Information extracted
Accuracy(%)
Sensor
Kapur
(2005)
Sadness,joy,anger,
fear
Acted
Standing,full body
Velocity and position
of body joints
up to 92
VICON motion
capture system
Kleinsmith,
Bianchi-
Berthouze
(2007)
Defeat,frustration,
triumph,concentra-
tion
Real
Standing,full body,
static postures
Basic posture cues
related to the dis-
tance of adjacent
joints
60
Gypsy 5 motion
capture system
Glowinsky
(2011)
Elation,amusement,
pride,hot anger,
fear,despair,plea-
sure,relief,interest,
cold anger,anxiety,
sadness
Acted
Standing,only
head,right and left
hand
Body posture rep-
resenting through
changes in body
extension,arm and
upper body position
96
Consumer video
camera (25 fps)
Sanghvi
(2011)
Engagement
Real
Children playing
chess with iCat
robot.Upper body
Meta-features re-
lated to the move-
ment and posture
cues
82.2
Video camera
Garber-
Barron,Si
(2012)
Triumphant,concen-
trated,defeated,frus-
trated
Real
Standing,full body
Meta-features,joint
rotation angles
66.5
Gypsy 5 motion
capture system
D’Mello
(2009)
Boredom,confusion,
delight,flow,frustra-
tion,neutral
Real
Sitting position.
Interactions with
an intelligent
tutoring system
Average pressure,
spatial and tem-
poral properties of
pressure
71,55,46,40
Body pressure mea-
surement system
Table 2.2:List of related works.
3
Prerequisites
In this chapter the prerequisites for this work are explained.It starts with the
description of the Microsoft Kinect for Windows and Microsoft Kinect Software
Development Kit 1.6,which are the central components of this work.Next
section explains the methods that are usually used in sequence classification.
The last section describes the functionality of Self-Organised Maps and their
possibilities.
3.1 Microsoft Kinect For Windows
3.1.1 Kinect sensor
The first announcement of the Kinect sensor development was made in June
1,2009 under the project name Project Natal.The device was launched in
November 2010 as a part of the Xbox platform.A lot of work was put into re-
verse engineering the device,making it available for usage with other platforms
then the Xbox.The release of the first Microsoft Kinect Software Development
Kit 1.6 (in the following called SDK 1.6) was on June 16,2011.It allowed
to use the Kinect with a regular PC.Also other drivers were developed that
allowed the same,for example,OpenNI.The first commercial version of the
Kinect was launched in the early 2012 with the name Kinect for Windows.A
new SDK from Microsoft was also released,with improved skeleton tracking
and more possibilities (two depth ranges,seated mode etc.).The new SDK is
not compatible with the old device Kinect Xbox 360.But there is no significant
hardware differences between the two versions.
The Kinect hardware consist of two cameras (IR and Colour),IR emitter,
microphone and a status LED,shown in Figure 3.1.The Kinect emits a known
light pattern and it is able to infer the depth by observing its deformations in
a scene [25,7].
15
3.1.MICROSOFT KINECT FOR WINDOWS 16
Figure 3.1:Kinect hardware,inspired by [25].
3.1.2 Microsoft Kinect SDK 1.6
The SDK 1.6 provides out of the box human skeleton tracking.The Kinect
can recognise up to six people in the field of view of the sensor,but can only
track two simultaneously.Two tracking modes are available - near mode and
default mode.In the near mode the Kinect can track people standing between
0.4 and 3 meters,in the default mode the range is between 0.8 and 4 meters.
Skeleton tracking tracks 20 human joints and provides their position and ori-
entation,the reference point is placed in the Kinect sensor.The position data
is expressed in Projective Coordinates and World Coordinates.The Projective
Coordinates consist of a 3Dvector {x,y,z},where x and y are point pixel values
and z is real world distance,expressed in millimetres.The World Coordinates
is the projections of point x and y values into Euclidean space,x and y are
expressed in meters,z value is in millimetres.
The joint orientation is provided in two ways:Absolute Orientation and Hi-
erarchical Rotation,expressed in a form of quaternions and rotation matrices.
The Absolute Orientation gives the joint rotation information in Kinect cam-
era coordinates.The Hierarchical Rotation provides information about how
the child joint is rotated relative to it’s parent.The hierarchy of joint is shown
in Figure 3.3,where the HipCenter is the root joint [26].
3.1.3 Real-Time Human Body Joint Acquisition
SDK 1.6 skeleton tracking is closed source,so there is no concrete information
available about how it is implemented.In 2011 Shotton et al.[37] published
a paper that explains how the skeleton joint information is computed.But
since 2011 Microsoft has released several versions of the SDK with improve-
ments to the skeleton tracking.My conclusion is that this publication is not
an exact way the skeleton tracking is implemented in SDK 1.6,but it is defi-
nitely an inspiration or serves as a base for it.Below is a summary about the
implementation of the skeleton tracking.It is based on [37].
The body joints 3D positions are extracted from a single depth image.For
each depth image for each pixel a classification task is performed.
3.1.MICROSOFT KINECT FOR WINDOWS 17
Figure 3.2:Joint positions and abbreviations.
Figure 3.3:Joint hierarchy,inspired by [26].
3.2.SEQUENCE CLASSIFICATION 18
Firstly the background is removed from the depth image,leaving only human
pixels.Then the depth image features are extracted for each of the remaining
pixels using equation 3.1,where depth features is the difference of depth at
two pixels.An example is shown in Figure 3.4.
f

(I;x) = d
I
(x +
u
d
I
(x)
) d
I
(x +
v
d
I
(x)
) (3.1)
where d
I
(x) is the depth pixel x in the image I, = (u;v) describe offset to
the second pixel.
Figure 3.4:Depth image features,taken from [37].The yellow crosses indicates pixels
being classified.The red circles indicate the offset pixels.In (a),the two example
features give a large depth difference response.In (b),the same two features at new
image locations give a much smaller response.
A random decision forest is used to classify the pixels.A large amount of
synthetic and real depth image data was used,making the guess for joint
positions more accurate.
3.2 Sequence Classification
A sequence is an ordered set of elements,for example,time series or DNA
sequence.The problemof sequence classification is that most classifiers (Neural
Networks,decision trees,Support Vector Machines (in the following called
SVM)) take as input a fixed size vector of single instance values.
A sequence classification survey by Xing,Pei and Keogh [42] distinguish three
different categories for sequence classification.They are presented below.
Feature Based Classification
To transform a multi-instance
1
problem into a single instance problem,
one could aggregate the input into bags by computing mean,maximum
and other features — values that summarizes each instance.But it is
1
At each time t there is a set of values
3.2.SEQUENCE CLASSIFICATION 19
a difficult task to find the values,sometimes named features,that sum-
marizes the instances.For time series proposed feature is time series
shapelets —"the time series subsequence,which can maximally repre-
sent a class"[42].
Sequence Distance Based Classification
"Sequence distance based methods define a distance function to measure
the similarity between a pair of sequences"[42].For distance measure
can be used Euclidean distance or Dynamic Time Warping (in the fol-
lowing called DTW).After different classification methods can be used,
for example SVM.The problem with using SVM in sequence classifi-
cation is choosing the correct kernel function.Usually used kernels for
sequence classification are polynomial-like kernels,kernels derived from
probabilistic model and diffusion kernels.
Model Based Classification
Model Based Classification"...is based on generative models,which as-
sume sequences in a class are generated by an underlying model M.Given
a class of sequences,Mmodels the probability distribution of the sequences
in the class.Usually a model is defined based on some assumptions,and
the probability distributions are described by a set of parameters.In the
training step,the parameters of M are learned.In the classification step,
a new sequence is assigned to the class with the highest likehood"[42].
Such classifiers as Naive Bayes and Hidden Markov Model are usually
used.
3.2.1 Dynamic Time Warping
"Dynamic time warping (DTW) is a well-known technique to find an optimal
alignment between two given (time-dependent) sequences under certain restric-
tions.Intuitively,the sequences are warped in a non-linear fashion to match
each other.Originally,DTW has been used to compare different speech pat-
terns in automatic speech recognition"[27].
DTW compares two time-depended sequences by finding local cost measure
(distance measure) between the sequences elements.The cost is small if the
elements are similar and high when opposite.After computing local cost mea-
sure per element an overall cost matrix is calculated.The alignment between
two sequences is the minimal cost path in the overall cost matrix,referred as
warping path.
3.2.SEQUENCE CLASSIFICATION 20
The optimal warping path should satisfy three conditions:
Boundary condition
First and last element of each sequence should be aligned to each other.
Monotonity condition
Element sequence should remain the same (path should not go down or
to the left).
Step size condition
No elements should be omitted.
The optimal warping path search through all possible warping paths is a costly
operation,so different techniques are used to speed up this process.One
technique is called global constraints that restricts the optimal path search.
Two widely used global constraint regions are Sakoe-Chiba band and Itakura
parallelogram,example of the two regions is shown in Figure 3.5,the warping
path can be selected only from the gray region.More detailed information
about DTWcan be found in [27].
Sakoe-Chiba band
Itakura parallelogram
Figure 3.5:Global constrains,taken from [27].
3.2.2 Support Vector Machines
SVMis a supervised learning algorithm for a two group classification problem.
The idea of SVMis to non-linearly map the input vectors to a high dimensional
feature space to construct a linear decision surface that separates the input
vector classes.The range of SVM usage application is huge,for example they
are commonly used in sequence classification [42].The rest of the section gives
an introduction into some of the necessary details related to SVM.It is inspired
by [41,40].
3.2.SEQUENCE CLASSIFICATION 21
Linearly separable classes
In the following we will explain the SVM functionality using an example.For
example,we assume that we have linear separable data,shown in Figure 3.6,
consisting of 2 classes —red and purple,we can define a line which can separate
the two classes.The problem is that we can define many of these lines,so the
question is which one of them to chose.
Figure 3.6:An example of problemdefinition.Red colour represents first class,purple
— second.
To answer that question,for each of the lines we can define an area,where
classification of the two classes will go wrong.This distance from a plane
(in this example line) to the closest point is called margin of error.Two
different margin examples are shown in Figure 3.7 with yellow colour.The
problem is now how to get the best possible margin.There is an intuitive and
mathematical proof that the biggest of all possible margins is the best margin.
(a)
(b)
Figure 3.7:Different size margin example.
So to find the separating plane (or hyperplane),we need to find the maximum
possible margin.If we denote plane as w and the closest point to that plane
as x
n
,then the distance from x
n
to the plane w can be computed by equation
3.2.
distance =
1
jj w jj
(3.2)
The goal is to find the maximummargin,which means we have an optimization
problem,that can be defined as following:
Maximise
1
jj w jj
subject to min
n=1;2;:::;N
j w
T
x
n
+b j= 1,b is the bias.
3.2.SEQUENCE CLASSIFICATION 22
Finding the minimum is not optimal/not good for an optimization problem,
so the problem can be rewritten as following:
Minimize
1
2
w
T
w subject to y
n
(w
T
x
n
+ b)  1 for n = 1;2;:::;N,where y
n
is
the class label and we assume that all points are classified correctly.
The solution for a constrained optimization problem can be found using La-
grange and quadratic programming.The result from quadratic programming
are Lagrange multipliers .If  > 0 then the point is a support vector (SV).
SV support the separating hyperplane,so having found them one can com-
pute the separating hyperplane using equation 3.3.The example of separating
hyperplane with support vectors is shown in Figure 3.8.
w =
X
x
n
isSV

n
y
n
x
n
(3.3)
Figure 3.8:A maximum-margin hyperplane,taken from [41]
Non-linearly separable classes
SVMcan also be used for non linear separable problems.The main idea is that
if one have a non linear separable problem in X space,one can non-linearly
transform the problem into a highly dimensional space Z,where the problem
is linear separable.One do not need to know what the Z space is,just the
knowledge that Z space exist is needed,because only the inner product of Z
space is needed to compute the separating hyperplane.
For illustration we will assume that we have two points x and y;x;y 2 X;
and we need the inner product of Z space z
T
z
0
.z
T
z
0
= K(x;y).K(x;y) is
called the kernel.One only need to have the kernel,then one can use the same
computation as in the linear separable case.There are two ways to know if
the kernel is valid.Firstly,the kernel need to be symmetric and secondly,the
Mercer’s condition should be perceived.
The typically used kernels are polynomial kernel and radial basis function (in
the following called RBF) kernel.The polynomial kernel is shown in formulae
3.2.SEQUENCE CLASSIFICATION 23
3.4,the x and y are vectors in the input space.The RBF kernel is shown in
formulae 3.5.
K(x;y) = (x  y)
d
(3.4)
K(x;y) = exp( jj x y jj
2
) (3.5)
The two problems are shown in 3.9.The kernel deals with the problem b,but
when you have outliers as in a,there is no need for a complex non-separable
case solving.In real world problems you usually have a combination of both a
and b.The solution for the outlier problema is the defining the soft-margin pa-
rameter (model complexity).There are two parameters that define the model
complexity C ans , controls the width of the margin violation zone,while
C is the penalty for margin violation.
(a)
(b)
Figure 3.9:Class separation problems.
Multi-class Support Vector Machines
Usually SVMdeal with binary classification problems.The common approach
to convert frombinary classification to multi-class classification is to reduce the
multi class problem into multiple binary classification problems.Firstly you
can build a classifier that distinguish between one class versus all or between
every pair of classes (one versus one).
Classification for the one-versus-all case is done by having functions that pro-
duce comparable scores for each class,the highest score selects the class,as
in the winner-takes-all strategy.In the one-versus-one case,each classifier as-
sign a class,the class with the most votes from the classifiers determines the
resulting class.
In this work I amusing SVMimplemented using the data mining software Weka
[19],which works as described in this section,except further optimizations that
are described in [31].
3.3.DATA VISUALIZATION:SELF-ORGANISING MAP 24
3.3 Data Visualization:Self-Organising Map
Self-Organising Map (in the following called SOM) is a type of neural network
that is trained using unsupervised learning.It is used not only for clustering
data,but also for visualizing multidimensional data [43].
SOM is a set of connected nodes (neurons) organised in different topologies,
usually in rectangular or hexagonal topology.It is based on competitive learn-
ing,in which the output neurons compete amongst themselves to be activated,
with the result that only one is activated at any one time.This activated neu-
ron is called a winning neuron (Best-Matching Unit (in the following called
BMU)).Each winning neuron corresponds to one or more input vectors.
Two different SOMtraining algorithms can be used —sequential training and
batch training.In the sequential training SOMis trained iteratively.Firstly,a
random vector from input space is chosen and the distance between it and all
the neurons in topology is calculated using Euclidean distance (other distance
measure are also possible).The neuron with the smallest distance is the BMU.
In the batch training instead of taking one input vector at a time,all input
vectors are used before any adjustments to the map are made.
One way of visualizing SOM result is unified distance matrix (in the following
called u-matrix).The u-matrix represents the relationships between the neigh-
bouring neurons.The quality of SOMcan be evaluated with quantization error
E
QE
and topographic error E
TE
.Quantization error,shown in equation 3.6,is
the average distance between the input vectors X
p
and their neurons-winners
M
c(p)
.The topographic error E
TE
,shown in equation 3.7,shows how well the
trained network keeps the topography of the analysed data [5,43].
E
QE
=
1
m
m
X
p=1
jj X
p
M
c(p)
jj (3.6)
E
TE
=
1
m
m
X
p=1
u(X
p
) (3.7)
4
System Description
Emotion database design
and implementation
Data analysis
Hand position
improvement
Sequences
segmentation
Sequences
labeling
Sequences length
equalization
Data collection
Classification
Meta-feature
extraction
Result evaluation
Joint
data
extraction
Figure 4.1:The process sequence.Rhombus represents junction.
This chapter explains the whole system as shown in Figure 4.1.Firstly,the
emotional database was designed,then the system for data acquisition was
implemented.The next step was data analysis,improving,segmentation and
labelling.From the labelled sequences the 20 joints Projective Coordinates,
World Coordinates and Hierarchical Rotation (in the following called HR) were
extracted and saved for further processing.Then the data was preprocessed
—that included emotional videos length equalization,meta-feature extraction.
The system ended with data classification and evaluation.In the rest of the
chapter each step is described in more details.
25
4.1.DATA ACQUISITION 26
4.1 Data Acquisition
This thesis’ goal is real human emotion recognition.To be able to achieve the
goal,there is a need for an emotion dataset.The following sections explain
the process of acquiring the emotion data.
4.1.1 Emotion Database
Figure 4.2:Project set-up,parts taken from [1].
To my knowledge there is no existing emotion database,that contains human
emotions recorded using the Microsoft Kinect for Windows (in the following
called Kinect).To be able to acquire such data a special recording system was
developed through a series of experiments and different set-ups.
Emotion database design criteria are explained below:
Real-world emotions or acted emotions
Only few research concerning human emotion expression through body move-
ment and posture are based on real emotion recordings,others use actors.
Studies [35] show that acted emotions tends to be exaggerated.Because there
is so little research based on real human emotions we needed to record our own
data.
Participants
13 Students from age 20 till 29 were participating in the recordings.4 females
and 9 males.
4.1.DATA ACQUISITION 27
Chosen emotions
As stated earlier,the chosen emotions were anger,joy,sadness,fear,disgust
and surprise.
The recording
Alone in the room,full human body,one participant at a time.The project
set-up is shown in Figure 4.2,for showing the videos/game a 32 inch flat screen
TV is used with attached Kinect on top.The participants starting position
was around 2m away from the TV.They were allowed to move into different
directions.The TV is placed 1:7mabove the floor,the height was chosen based
on participant overall height and Kinect positioning recommendations.
How to stimulate responses?
"...it is believed that most emotions are outcomes of our response to different
situations"[15,p.573].We used different visual stimuli to create different
emotional responses.In order to acquire the six emotions,beforehand several
tests with different people were conducted.Videos were chosen empirically.
First,the videos were tested on two or three subjects.Their response was
recorded,afterwards they were asked how they felt.Based on their answers
and reactions,a specific video was removed from the set or kept.Performing
this operation multiple times a set of 13 different videos was collected,where
the videos length vary from 0:58 to 3:36 minutes.The input videos were taken
from youtube (attached on DVD).
After discussion with the test participants a set of recording system require-
ments was prepared:
• The whole recording process should not take more then 15 minutes,be-
cause people get tired of standing.
• One video length should not exceed 4 minutes,because people get bored
if the specific video is not in their interest.
• Videos that are negative should be short,around 1-2 minutes,otherwise
people will be likely to stop participating.
• There should be a variety of different videos to keep the person interested.
One does not always get the same response from the same video for different
people.Experiments showed that an emotion such as anger or surprise is
very hard to trigger by displaying a short video without knowing the person.
A special voice controlled game was developed to stimulate anger.The idea
with the game was to irritate the participant.This was achieved by sometimes
ignoring the users input.
4.1.DATA ACQUISITION 28
4.1.2 Recording System
Previous experiments were conducted to test different skeleton tracking li-
braries [24].The experiments showed that the best of the available libraries as
of this writing is The Kinect for Windows Software Development Kit 1.6 (in
the following called SDK 1.6).
In the project The Kinect for Windows device with SDK 1.6 was used.SDK
1.6 has a skeleton tracking method implemented and out of the box provide
human skeleton position and orientation of 20 joints (see Figure 3.2).
SDK 1.6 provides a special software called Kinect Studio for recording and
playback of colour,depth and skeleton streams.It is easy to use,but it has
limitations.First,it emulates a Kinect,so that you have a virtual Kinect,
meaning that you have to have the Kinect Studio and your program running
simultaneously.You can not:
• playback the recorded files if you do not have a Kinect plugged in
• stop and start the Kinect Studio recording from your software
• control the playback speed of the Kinect Studio,which makes it harder
to save the individually recorded frames if your computer is not powerful
enough
To be able to create a system that would show different videos to a participant
and play a game with him,there was a requirement for stopping and starting
the recording at any given time.Because this was a crucial feature,another
recording system Kinect Toolbox 1.2,developed by David Catuhe was used [8].
It was modified and improved,because of stream synchronization issues and
recording failures.
The recording system that shows x
1
amount of randomly chosen videos and in
the end play the turtle game was implemented.The SDK 1.6 SpeechBasics-
WPF project was used as an inspiration for the turtle game.In order for
a participant to feel more comfortable and easier to start the system,it was
made so that the participant start it by himself,using his hand to control the
mouse,also doing it this way it ensures that the skeleton is fully detected.
Initial screen of the recording system is shown in Figure 4.3.Videos were
shown one after another,the recording turned on when each video started and
turned off and saved the recording when the video ended.The same is true
for the game.
A screenshot of the turtle game is shown in Figure 4.4.The purpose of this
game is for the turtle to reach its goal (get food) by saying where the turtle
should go.Controlling the turtles movement is done by voice commands,this
also keeps the participant movement to a minimum.The maximum time to
1
Different amount of videos were shown,depending on their length
4.1.DATA ACQUISITION 29
Figure 4.3:Recording system initial window.The coloured hand position is controlled
by the participant’s right hand.The frame in the bottom right shows the Kinect depth
image with the recognised person.The recording system starts when the coloured hand
is on top of the red button for a short period of time.
reach the destination is set to 2 minutes.The player is allowed to stop it
at any given time by saying Stop.When the player says a direction for the
first time,the turtle moves in that direction.The next time the player says
a direction Forward,Left or Right,the direction in which the turtle goes is
chosen randomly with a 30% change to chose the given direction,making the
turtle move in the wrong direction most of the time.The direction Back is
always back.When the time expires or the destination is reached,a text
identifying the end of the game is shown and thereby concludes the recording.
When the turtle is close to the destination and go away from the goal,special
texts appears to indicate the turtles feelings.It was implemented in order to
make the participant more annoyed,angry or frustrated.
4.1.3 Skeleton Improvement
Previous experiments [24] showed that the hand,wrist,ankle and foot position
data acquired from SDK 1.6 are not stable.Hand and foot are more unstable
then wrist and ankle.This projects focus is on emotion classification,so only
quick improvements were implemented.It is hard to improve ankle and foot
data.Since the foot data is much more distorted then the ankle data,it was
eliminated from further processing.In order to improve the hand position,
participants were asked to wear clothes with long sleeves,that cover the arm
and expose only the skin on the hands.So skin tracking could be used to
improve the hand position.Therefore,the wrist joint was eliminated from the
4.1.DATA ACQUISITION 30
Figure 4.4:Turtle game screenshot.When participant says a direction,the direction
turtle move highlights in blue colour.
further processing.16 joints of the provided 20 were used.
To improve the hand position,firstly,the video sequences were processed using
SDK 1.6 built-in background removal.Which resulted in leaving only the
player’s(participant) pixels with some noise.
A program was implemented to process these images.The processing flow
is shown in Figure 4.6.Firstly,the frame is converted to YCrCb or HSV
colour space,the colour space choice was based on the participant’s clothes.
Then using an empirically found threshold the frames were converted to
a binary image.For the first frame we assume that the skeleton provided
wrist and hand positions are correct,so a contour containing the hand
is assigned.For the following frames the closest contour to the previous
contour was found and reassigned.When searching for the closest contour,
the contour that contain the head or opposite hand point are omitted.
New hand position is the centre point of the contour bounding rectan-
gle.If a contour was not found,the provided coordinate from the extracted
skeleton is used for hand position.Some examples are shown in the Figure 4.6.
This implementation have the following limitations:
• new hand position is not found if hands touch each other or the head
• when two hands are close to each other,there is a high change that
closest contours will be swapped
• do not work when skin is unavailable,or too much arm skin is visible
4.1.DATA ACQUISITION 31
• method needs manual supervision
Find Contours
Assign contour to each hand
Find nearest contour
[could not assign]
Assign value to joint
Convert to YCrCb/HSV
Convert to a binary image
[have more frames]
[first frame]
[frame++]
[found]
Figure 4.5:Hand position improving sequence diagram.Rhombus represents junc-
tion.
4.1.4 Time Series Segmentation
The participants saw different length videos and they expressed different emo-
tions throughout the video.For example,at the beginning the participant
might show boredom and later joy.Based on this fact the recorded videos
need segmentation to get one (or multiple) emotion per sequence.
From the recorded sequences a skeleton’s 16 joint HR angles for each frame
were saved into a xml file.The HR angles are given in quaternion vector
format,they were transformed in roll,pitch and yaw values.
The hypotheses is that if a human is observing some video input and suddenly
start to move,then it is very likely that it is the visual input he sees that make
him move.Taking this into consideration the idea is that the participant’s
sudden movements are possibly an emotion being expressed.To find these
sudden movements a change in the joint rotations from frame to frame were
calculated.
The changes were calculated using Euclidean distance from skeleton HR from
frame x + 3 to x,for example,from frame 3 to frame 0,then from frame 6 to
frame 3,etc.The pseudo-code is shown in Algorithm 1.
The number 3 was chosen empirically by comparing the result to actual video.
The changes were calculated for each joint and a graph was plotted,the ex-
ample is shown in Figure 4.7.
4.1.DATA ACQUISITION 32
(a)
(b)
(c)
(d)
Figure 4.6:Hand position improvement examples.HandLeft - green,HandRight -
yellow,HandLeft improved - cyan,HandRight improved - red.(a) and (b) shows the
successful hand improvement.(c) shows unsuccessful hand improvement,because the
left hand is not visible.(d) shows the threshold (c) image.
Algorithm 1 Joint HR changes calculation
for i = currentFrame!amountOfFrames do
i i +3
amountOfChanges =
p
(roll
i
roll
prev
)
2
+(pitch
i
pitch
prev
)
2
+(yaw
i
yaw
prev
)
2
prev = i
end for
To be able to get the sequences where there is significant changes in the joint
HR angles,the sequences were cut using a threshold.The sequence cutting
threshold was also chosen empirically,it’s value is 0.02.The sequence is cut
using the threshold 10 frames,which also was found empirically.The program
outputs the ranges for each joint.Then they manually are combined,finding
the smallest and the highest frame number for each joint per graph peak.
Then the cut video pieces are combined into a video sequences.This could be
automatized by extending the software to look for overlapping results.
4.1.5 Sequence Labelling
Emotion sequences need to be labelled to be able to classify them.In this
experiment it was not possible to get the ground truth of what the observed
person was feeling in specific moment,so the labelling was performed manually.
After processing the participant’s responses a set of 304 video fragment were
created.These fragments were manually labelled using a group of four people.
The group was not told the total amount of emotions nor were they told
4.1.DATA ACQUISITION 33
Figure 4.7:Possible emotion cut example.The check boxes in the bottom of the image
allow to select the joint.Skip,in the bottom right of the window,corresponds to
amount of skipped frames.Refresh button with specified Start and End frame number
allows to inspect the interested region.Joint changes are plotted in the graph,each
joint is shown in different colour.PE indicates that it could be possible emotion.
which emotions to look for.The knowledge about what the participant was
watching in each video was told to the group after they classified the videos.
After getting the knowledge of what the participant was watching,some of the
emotions that were labelled as bored,became disgust.250 videos were classified
as containing emotions.Five classes,that contained the most video fragments
were taken.These five classes differ from the initially expected and they are
curiosity,confusion,joy,boredom,disgust.The emotion bored contained much
more data then the other emotions.To make it an equal amount to the rest
of data,some of the bored fragments were removed.It was randomly chosen
from each participant.The total amount of emotion videos is shown in the
Figure 4.8.
4.1.DATA ACQUISITION 34
0
10
20
30
40
50
31
10
41
8
36
10
41
12
38
7
fragments
participants
Curious
Joy
Confusion
Boredom
Disgust
Emotion
Amount
Figure 4.8:Collected data.The dark colour shows the amount of emotion videos and
the light colours shows the amount of people expressing the emotion.
4.2.DATA PREPROCESSING 35
4.2 Data Preprocessing
4.2.1 Equalizing the Length of the Emotion Videos
The Support Vector Machines (in the following called SVM) was selected as
a classification method and the used implementation can not deal with the
different length data.Therefore,before sending data to the SVM,the data
length needs to be equalized.
In order to have the same length in all video sequences,two different techniques
were used.One method that is commonly used in sequence recognition is
Dynamic Time Warping (in the following called DTW).The other technique
is Simple Video Stretching (in the following called SVS),which takes each
frame and stretches it to the amount of frames.
The DTW and SVS performance will be shown on a small example.The
example contains two data series of different lengths —Series A and Series B
— as shown in the Table 4.1.

1
2
3
4
5
6
7
8
9
10
x
1
1
2
3
4
5
6
7
8
9
y
0
0
1
1
1
1
0
0
1
0
z
-2
-2
-2
-2
-2
-2
-3
-3
-2
-4
(a) Series A

1
2
3
4
5
x
6
6
7
7
8
y
1
0
0
1
0
z
-2
-4
-3
-2
-4
(b) Series B
Table 4.1:Example input.
The DTWis performed with no constraints and using Euclidean distance,the
cost matrix result is shown in Figure 4.9.DTWand SVS end results are shown
in Figure 4.10.
For DTWeach dimension of data with the same weight was used in the path
computation.For example,if DTW was performed on joint data,where each
frame consist of 16 joints and each joint has roll;pitch;yaw value,that forms a
163 = 48 dimension space.Each dimension is part of the DTWcomputation.
Performing DTW on the emotion dataset,every sequence is compared with
the longest sequence.
4.2.2 Meta-Feature Extraction
Different publications [17,16,18,10,33] shows that use of meta-features
2
is
better then raw joint rotation data in emotion classification.The work with
the highest success rate on real human emotion is by Garber-Barron and Si
[16].This work is therefore used as inspiration and most of their proposed
meta-features are reimplemented or modified in this work.
2
Recurring substructures [21]
4.2.DATA PREPROCESSING 36
Figure 4.9:DTW result cost matrix.Cost matrix of the two sequences — Series A
and Series B using the Euclidean distance as local cost measure.Regions of low cost
are indicated by dark colours and regions of hight cost — light colours.The optimal
warping path is shown with a red line.










2
4
6
8
10
2
4
6
8
10

Series A after DTW/SVS
Series B after DTW
Series B after SVS
Sequence Frame Number
Output Frame Number
Figure 4.10:DTW and SVS result.
4.2.DATA PREPROCESSING 37
Feature Definition & Explanation
Pose Difference
Algorithm is shown in Algorithm 2 (Appendix A),it calculates the Eu-
clidean distance between left and right part of the body and returns the
mean.
Pose Symmetry
"Estimated the asymmetries caused by the misalignment of joints"[16].
The algorithm is shown in Algorithm 3 (Appendix A).
Directed Symmetry
Calculates the Pose Symmetry with the direction of the asymmetry.The
algorithm is shown in Algorithm 4 (Appendix A).
Head Offset Alignment
Estimates the relationships between head and chest,head and hips.The
algorithm is shown in Algorithm 5 (Appendix A),it returns three val-
ues — HeadOffset,HeadAlignment and HeadChestRatio.HeadOffset is
the Euclidean distance between head location and hip middle point.
HeadAlignment is the Euclidean distance between head rotation and av-
erage hip rotation.HeadChestRatio represents the head rotation rela-
tionship to the chest rotation.
Leg Hip Openness
Represent the openness of the lower part of the body by computing the
ratio between the hip-ankle distance and knee distance.The algorithm
is shown in Algorithm 6 (Appendix A).
Average Rate Of Change
Estimates the speed of changes of the feature over a specified time interval
(window).Feature can be joint or Pose Difference or etc.The algorithm
is shown in Algorithm 9 (Appendix A).
Relative Movement
Represents the amount of movement of a feature over a period of time
(window) compared to the entire sequence.The algorithm is shown in
Algorithm 10 (Appendix A).
Smooth-Jerk
"Represent relative variance of the feature over time.Sudden changes
generate bigger values,less rapid changes - smaller"[16].The algorithm
is shown in Algorithm 11 (Appendix A).
Body Lean
Body Lean is the feature that is often used in human interest recognition
[10,22],also in this work observation showed that such feature could be
useful for emotion curiosity.In this work Body Lean represents the dif-
ference between average hips and shoulder position.Shows the direction
4.2.DATA PREPROCESSING 38
and amount of movement in z axis.The algorithmis shown in Algorithm
8 (Appendix A).
Body Openness
Garber-Barron and Si [16] estimated the lower body openness (see Leg
Hip Openness),so the feature that computed the upper body openness
was introduced.It represents the amount of upper body openness.Body
Openness is calculated using Projective Coordinates,by evaluating the
alignment of the shoulder,elbow and hand.The algorithm is shown in
Algorithm 7 (Appendix A).
Garber-Barron and Si divided meta-feature in three different group to test
their performance.The groups and their names are taken from [16].
First group name is Posture Group,shown in Table 4.2,it contains 10 different
features proposed by Garber-Barron and Si [16] with 2 features —Body Lean
and Body Openness — proposed in this work.Each element from Posture
Group is computed for each frame in a sequence.
Short Name
Feature Name
Parameters
P0
Pose Difference
Left Arm HR,Right Arm HR
P1
Pose Difference
Left Leg HR,Right Leg HR,Head HR
P2
Pose Symmetry
Left Arm HR,Right Arm HR,Head HR
P3
Directed Symmetry
Left Arm HR,Right Arm HR,Head HR
P4
Pose Symmetry
Left Leg WL,Right Leg WL,Hip WL
P5
Directed Symmetry
Left Leg WL,Right Leg WL,Head WL
H1
Head Offset
H2
Head Alignment
H3
Head Chest Ratio
L1
Leg Hip Openness
BL
Body Lean
OP
Body Openness
Table 4.2:Posture Group definition.HR is short for Hierarchical Rotation and WL
is short for World Coordinates.
The second group is named Limb Rotation Movement Group,it contains 18
features that are computed per each frame.For each feature,inputs are com-
bined joint data,the possible combinations with names are shown in Figure
4.11.Each combination is an average value of joints,for example,chest is a 3D
vector,where first value is average (ShoulderCenter
x
;Spine
x
;HipCenter
x
),
second value is average (ShoulderCenter
y
;Spine
y
;HipCenter
y
) etc.
Third group name is Posture Movement Group that combines movement fea-
tures.It contains 36 features.Table 4.4 shows the elements of the group.
For the simplicity,in this work the three groups are combined into one group
called Meta - Feature Group,but when evaluating the performance,the results
of each individual group will be also shown.
4.2.DATA PREPROCESSING 39
Figure 4.11:Joint Combinations,skeleton taken from [1].
Feature Name
Parameters
Average Change of Rate
Joint combinations
Relative Movement
Joint combinations
Smooth-Jerk
Joint combinations
Table 4.3:Limb Rotation Movement Group definition.
Feature Name
Parameters
Average Change of Rate
Posture Group Elements
Relative Movement
Posture Group Elements
Smooth-Jerk
Posture Group Elements
Table 4.4:Posture Movement Group definition.
4.3.DATA CLASSIFICATION 40
4.3 Data Classification
To classify emotion sequences a SVM supervised learning with SMO imple-
mentation was used.The SVM was chosen,because it performs well on data
sets that have many attributes,even if there are very few cases on which to
train the model.SVM is also often used in sequence classification.SVM is
implemented in the data mining software Weka.The preprocessed emotional
data was transformed into a special Weka file.
The SVM input will be shown in the example below.For example,we want
to classify emotional sequences using raw data — 16 joint HR.If we assume
that we have:
• 20 emotion sequences
• the longest sequence is 100 frames
Then the SVMinput is 20 vectors with the length of 100163 = 4800,where
100 is the number of frames,16 is the number of joints and 3 is roll;pitch;yaw
angles for each joint.
The input for every emotional sequence looks as follows:
{HipCenterX1,HipCenterY1,HipCenterZ1,SpineX1,SpineY1,SpineZ1,....,
AnkleRightX100,AnkleRightY100,AnkleRightZ100}
5
Experiments and Analysis
This chapter starts with analysis of emotion sequences with showing several
examples.As stated in chapter 4,to be able to classify emotion sequences
using Support Vector Machines (in the following called SVM),their length
needs to be equalized.In this chapter the performance of two different length
equalization methods Dynamic Time Warping (in the following called DTW)
and Simple Video Stretching (in the following called SVS) is evaluated and
the best of them is chosen.The next step is finding the best parameter set
for SVM.After the best SVM parameters are fixed,each individual element
of input (raw data and meta-features) is analysed.Next step involves testing
the classification accuracy by selecting joint and feature combinations.The
analysis of each emotion identification are also shown.What’s the classification
accuracy when none of the person’s emotional data is a part of a training set?
The last step shows raw data observation.
5.1 Recorded Data —Observation/Evaluation
The recordings were made with 15 frames per second,due to hardware limita-
tions.Since the Kinect Toolbox record the streams in rawformat,the computer
could not keep up with the system I/O.
Having made multiple video response tests,it was observed that some people
feel different emotions when watching the same video.A set of videos that
could trigger the same emotions in every person was not found.Observations
showed that females are expressing their emotions clearer and the predicted
emotion with the specific video was triggered.Males mostly express different
emotions from predicted.The primary idea was to trigger anger,joy,sadness,
fear,disgust and surprise.After processing all the recordings it became clear
that not all of the basic emotions were triggered in every person.
The response on the turtle game was also different from expected.Most of the
time in the beginning people reacted with confusion,they did not understand
41
5.1.RECORDED DATA — OBSERVATION/EVALUATION 42
what was going on.Then they were mostly interested.For some sometime
frustration,irritation and annoyance was observed.Most of the persons did
not complete the game,only two participants did.All females gave up,saying
Stop in about a minute of the game time.
As predicted,the same emotion was expressed differently even by the same
person.They differ with the participant’s movements and its duration.Even
though the expression of emotions were different,the same characteristics could
be observed,for example,in disgusting participants turned away from the TV
and kept a closed body posture.
In the remainder of this section each emotion is represented twice.From
the emotional sequence each fifth frame is displayed.The first represented
emotion is boredom as shown in Figure 5.1.This emotion was triggered by
the same video by two different participants.The starting position for the two
participant is different,but the body movements have similarities.
The expression of joy is shown in Figure 5.2.The expression of joy from
the two participants is different in their movements and it’s duration.As it
can be seen in Figure 5.2,participant (a) express emotion more intense then
participant (b).
The different expression of curiosity is show in Figure 5.3.It was observed
that most participants when expressing curiosity were bending towards the
visual input.
The reaction on the game’s first seconds from the two participants is shown in
Figure 5.4.The body movements of two people are completely different.
The last emotion is disgust.The reaction from one person is shown in Figure
5.5.The lengths of the two sequences are different,but both contain the same
characteristics,for instance,turning the head away.
A closer inspection,show that boredom and disgust are having the same at-
tributes,for example,turning the head down.It is also possible to find a lot
of similarities in body movements across all of the emotions.
5.1.RECORDED DATA — OBSERVATION/EVALUATION 43
(a)
(b)
Figure 5.1:Boredom expressed.Only every fifth frame from the sequence is shown.
5.1.RECORDED DATA — OBSERVATION/EVALUATION 44
(a)
(b)
Figure 5.2:Joy expressed.Only every fifth frame from sequence is shown.
5.1.RECORDED DATA — OBSERVATION/EVALUATION 45
(a)
(b)
Figure 5.3:Curiosity expressed.Only every fifth frame from sequence is shown.
5.1.RECORDED DATA — OBSERVATION/EVALUATION 46
(a)
(b)
Figure 5.4:Confused expressed.Only every fifth frame from sequence is shown.
5.1.RECORDED DATA — OBSERVATION/EVALUATION 47
(a)
(b)
Figure 5.5:Disgust expressed.Only every fifth frame from sequence is shown.
5.2.DYNAMIC TIME WARPING VERSUS SIMPLE VIDEO
STRETCHING 48
5.2 Dynamic Time Warping versus Simple
Video Stretching
Input dataset
DTW
SVS
SVM
Compare performance
[optimizing DTW parameters]
[choose with the best performance]
Figure 5.6:Video length equalizing method evaluation.Rhombus represents junction.
The experiment purpose is to evaluate the performance of two methods that
makes video sequences the same length.For DTWdifferent input parameters
are also evaluated.The test inputs are raw data (joint Hierarchical Rotation
angles) and meta-features (in the following called MF).The MF window
1
size
was set to 20 %.The window size was chosen empirically,taking into con-
sideration the length of shortest and longest emotion sequence.For detailed
information about window see section 4.2.2.The sequence of action for test-
ing is shown in Figure 5.6.Firstly,the inputs are extracted from the emotion
dataset and fed into DTWand SVS.Then the results are giving to SVM with
the default Weka parameters
2
.The overall classification accuracy performing
10 folds cross-validation is used for comparison between the methods and DTW
parameters.For DTWManhattan,Euclidean and Square Euclidean distances
were tested.Then the most successful distance is set.After the DTW per-
formance with Sakoe-Chiba band is tested with the values in range from 1 to
50.For DTW the NDTW library [28] was used.In the used implementation
there was a possibility to test an Itakura parallelogram constraint,but it did
not work for the input data and a reason for that was not found.
Accuracy (%)
Distance
Raw Data DTW-> MF MF -> DTW
Euclidean
40.11 48.66 40.64
Manhattan
43.85 46.13 41.18
Squared Euclidean
35.83 42.79 39.58
Table 5.1:DTW distance test results.
1
Number of frames
2
C = 1; = 10
12
;Polynomial kernel (exponent = 1)
5.2.DYNAMIC TIME WARPING VERSUS SIMPLE VIDEO
STRETCHING 49
Accuracy (%)
Sakoe-Chiba band
Raw Data DTW-> MF MF -> DTW
1
40.10 44.92 41.18
5
41.18 45.45 41.18
10
41.71 44.92 41.18
50
42.78 49.20 41.18
Table 5.2:DTW Sakoe-Chiba band test results.
5.2.1 Raw Data Result Discussion
The DTW parameter test results are shown in Table 5.1 and Table 5.2.The
performance of SVS is 43.32 %.The results show that there is little dif-
ference between the methods and parameters.The classical DTW (with no
constraints) using Manhattan distance the performance was slightly higher,
therefore this DTWconfiguration was selected for further processing.
5.2.2 Meta-feature Result Discussion
Two possibilities with the meta-features were tested:
• Perform length equalizing on raw data,then extract the meta-features.
• Extract meta-features first and then perform length equalizing.
The two DTW test results are shown in Table 5.1 and Table 5.2.The per-
formance of SVS is 41.05 % for DTW -> MF and 34.71% for MF -> DTW.
Results show that the best way is to first perform DTWfor the raw data and
then having equal length video sequences,extract meta-features.The DTW
parameter results are also not significantly different.The best performance was
achieved with Euclidean distance and with the constraint of 50 by Sakoe-Chiba
band.These settings were used for further processing.
A problem may arise when choosing first to perform DTW and then extract
meta-features,because short sequences were stretched,it produces a lot of zeros
in meta-features that calculates the changes in body movements and postures.
It could leads to a problem that the classifier could classify just short and long
sequences.Table 5.3 shows that the emotion sequence length variance is high
between emotion classes,that means that short and long sequences are present
in all classes.
5.3.SUPPORT VECTOR MACHINES PARAMETER SELECTION 50
Class
Variance
Sd
Minimum
Maximum
Mean
Curiosity
803.58
28.35
10
145
37.80
Joy
481.19
21.94
13
129
58.39
Confusion
768.24
27.72
18
142
46.65
Disgust
914.38
30.24
10
151
52.15
Boredom
506.36
22.50
46
133
78.12
Table 5.3:Emotional sequence length evaluations.Sd is short for Standard Deviation.
Values represent amount of frames.
5.3 Support Vector Machines Parameter Selec-
tion
16 Joint HR
DTW
Find best parameter set
Create 10 training and testing sets
Get a set
Evaluate on testing set
[time++]
SVM
[time < 10]
Select the best parameter set
Figure 5.7:SVM parameter selection flow.
For each problem correct SVM parameters need to be selected in order to
achieve the best possible result.Therefore in this section different SVM pa-
rameters are tested and their results presented.The SVM inputs are:
• Raw data after stretching using classical DTWwith Manhattan distance.
• Meta-features,extracted after equalising raw data with the DTW (Eu-
clidean distance,Sakoe-Chiba band = 50).
The input dataset was randomly divided into training and test sets,80 %
went for training and 20 % for testing,with no data repetitions and equal
distribution of emotions;this was performed 10 times.To find the best set of
parameters the WEKA parameter optimising meta-classifier GridSearch was
used.Two SVMkernels were chosen:Polynomial kernel and RBF kernel.The
SVM model complexity parameter C range was set in the interval 1::30, was
set to be constant and equal to 10
12
.GridSearch was comparing the overall
classification accuracy.For the Polynomial kernel,the value of the exponent
is set to be in range of 1::10.For RBF kernel the value was set to be in the
range 10
5
::10
2
.The grid was set to be extendible.For detailed information
about SVM,see section 3.2.2.
5.3.SUPPORT VECTOR MACHINES PARAMETER SELECTION 51
5.3.1 Raw Data Result Discussion
The Polynomial kernel results are shown in the Figure 5.8,RBF kernel results
are shown in the Figure 5.9.The performance from each training/test set is
different,the sd
3
of accuracy for the Polynomial kernel is 6:76 and the mean
accuracy is 44:96%,for RBF kernel the sd of accuracy is 8:62 and the mean
accuracy is 43:85%.
The reason why results are so close,could possibly be the lack of data and/or
both kernels are suitable for emotion recognition problem.A little higher per-
formance is achieved with Polynomial kernel and the most common selection
for parameters is C = 23;exponent = 4.The SVM parameters (C = 23,
Polynomial kernel (exponent = 4)) will be set for further processing.
3
Standard Deviation
5.3.SUPPORT VECTOR MACHINES PARAMETER SELECTION 52










10
0
5
10
15
20
25
30
35
Fold
Values




















2
4
6
8
10
36
38
40
42
44
46
48
50
52
54
56
58
Accuracy (%)



SVM with Polynomial kernel
c
e
x
p
o
n
e
n
t
Accuracy (%)
Figure 5.8:Polynomial kernel optimal parameter search results.










10
0
5
10
15
20
Fold
Values




















2
4
6
8
10
30
35
40
45
50
55
60
Accuracy (%)



SVM with RBF kernel
c

Accuracy (%)
Figure 5.9:RBF kernel optimal parameter search results. nine out of ten folds is
10
4
,in fold number three = 10
3
5.3.SUPPORT VECTOR MACHINES PARAMETER SELECTION 53
5.3.2 Meta-Feature Result Discussion
The results of SVM parameter search for meta-features is shown in the Table
5.4.Compared to raw data results,the same can be noticed with these results
—there is little difference.The sd for RBF kernel is smaller then polynomial
kernel,so the RBF kernel is selected for further processing.The selected
parameter values are C = 18; = 10
4
.As shown in Figure 5.10,the variance
of C parameter is high,therefore an average value of C = 18 have been chosen.
Fold
Kernel
1
2
3
4
5
RBF
25.00
36.11
36.11
38.88
35.13
Polynomial
33.33
36.11
27.78
30.55
37.84
Fold
Kernel
6
7
8
9
10
mean
sd
RBF
27.77
33.33
29.73
27.03
24.32
31.34
5.21
Polynomial
25.00
41.67
27.03
24.33
27.03
31.52
6.09
Table 5.4:Meta-feature SVM optimal parameter search results.










10
0
5
10
15
20
25
30
Fold
Values




















2
4
6
8
10
24
26
28
30
32
34
36
38
Accuracy (%)



SVM with RBF kernel
c

Accuracy (%)
Figure 5.10:Meta-feature RBF kernel optimal parameter search results. is 10
4
.
5.4.RESULTS 54
5.4 Results
This section will show the results for raw data and meta-features using the
found parameters for video length equalization and SVM.The overall classifi-
cation percentage performing 10 folds cross-validation on the entire emotional
dataset is used for comparison.Results from each individual parameter (spe-
cific joint,specific feature) is also shown.Later in the section the search for
optimal joints and features combination is presented.Afterwards the perfor-
mance of the combined results (raw data and meta-features) is shown.In the
end of the section a test using an unknown person (not part of the learning
set) is performed.
Raw Data Results
An interesting point is to test each individual joint performance.Re-
sults in Figure 5.11 show that a higher classification rate is achieved
by combining all 16 joints.The worst classification accuracy is by us-
ing only HipLeft joint,suggesting that the least movement or the same
movements were done with this joint across the emotions.
0
10
20
30
40
31.55
31.02
29.41
29.41
28.34
31.02
21.93
25.67
21.93
24.06
15.51
29.95
29.41
26.20
32.62
25.13
43.32
HC
S
SC
H
SL
EL
HL
SR
ER
HR
HipL
KL
AL
HipR
KR
AR
All
Joint
Classification (%)
Figure 5.11:Individual joint classification results.For joint labels see Figure 3.2.
Meta-feature Result
Each group of features,that was predefined in chapter 4.2.2 is evaluated.
The Posture Group results are shown in Figure 5.12.The mean value is
23:57% with the sd = 3:02,meaning that performance of Body Lean and
All features from Posture Group is better then others features.
5.4.RESULTS 55
The results of Limb Rotation Movement Group are shown in Figure 5.13.
The mean is 24:24% and sd is 2:98.The performance of AverageChange-
OfRate(LeftArm/RightArm) is less then standard.The performance of
RelativeMovement(All)/SmoothJerk(All) is more then standard,mean-
ing that using them classification accuracy is better.Limb Rotation
Movement Group combined result is 40:11%
The last group Posture Movement Group results are shown in Figure
5.14.Results show that classification rate of each individual feature is
low.The mean value is 22:69% with sd of 2:43.The performance of
RelativeMovement(RightArm/LeftLeg/All) and SmoothJerk(Chest/All)
is higher then other features.Results also show that combining all fea-
tures,not always leads to better classification rate,for instance,Av-
erageChangeOfRate(All).Posture Movement Group combined result is
42:78%.
Comparing the results to the work of Garber-Barron and Si [16],for them
the performance of Posture Movement Group was also a little higher
then the other groups.The result show that it is better to look at how a
specific feature was changing during some time interval rather then just
static postures per frame (pose symmetry,head relationships with hips
etc.).
0
5
10
15
20
25
30
35
24.60
22.99
21.93
21.39
21.39
22.99
22.99
20.86
21.39
22.46
27.81
24.06
31.55
P0
P1
P2
P3
P4
P5
H1
H2
H3
L1
BL
OP
All
Feature
Classification (%)
Figure 5.12:Individual feature from Posture Group classification results.For feature
labels see Table 4.2.
5.4.RESULTS 56
0
5
10
15
20
25
30
35
20.86
20.86
21.93
22.46
22.99
24.60
21.39
24.60
26.20
22.99
24.06
26.74
25.67
28.88
25.13
22.46
21.93
26.20
24.60
21.39
33.16
LA
RA
H
C
LL
RL
All
LA
RA
H
C
LL
RL
All
LA
RA
H
C
LL
RL
All
Avg Change of Rate
Relative Movement
Smooth Jerk
Feature
Classification (%)
Figure 5.13:Individual feature from Limb Rotation Movement Group classification
results.For feature labels see Figure 4.11.
0
5
10
15
20
25
30
35
21.93
22.46
21.93
22.99
19.79
22.99
20.86
24.06
22.46
20.86
22.99
22.99
24.06
22.99
20.32
20.86
22.99
21.39
21.93
23.53
21.39
21.39
22.46
22.99
21.93
24.60
22.99
24.06
20.32
19.79
22.99
19.79
24.60
22.99
21.39
25.13
24.06
22.99
34.76
P0
P1
P2
P3
P4
P5
H1
H2
H3
L1
BL
OP
All
P0
P1
P2
P3
P4
P5
H1
H2
H3
L1
BL
OP
All
P0
P1
P2
P3
P4
P5
H1
H2
H3
L1
BL
OP
All
Avg Change of Rate
Relative Movement
Smooth Jerk
Feature
Classification (%)
Figure 5.14:Individual feature from Posture Movement Group classification results.
For feature labels see Table 4.2.
5.4.RESULTS 57
Joint and Feature Selection Result
This experiment was performed to test if there is a specific combination
of joints/features that perform better then taking all joints/features.To
test this the emotional dataset was divided into training and testing sets
randomly five times.For each set a specific combination of joints/features
was selected randomly 1000/1500 times,then they were fed into SVM
and tested on testing set.
The results for joints are shown in Table 5.5,the mean classification
accuracy is 35:27% with sd of 11:84.High sd indicates that there is
not enough data in the emotional dataset or data is too different.All
"winning"joint selections were tested on the entire dataset using 10 fold
cross-validation,the results show that none of the found joint selections
is better then using all 16 joints,results are not significantly different.
Accuracy (%)
Fold
Joint selection
testing set
entire set
1
HC,S,SC,SL,HL,ER,HR,AR
27.78
37.43
2
HC,S,SC,EL,SR,ER,HR,AL,HipR,KR
44.44
40.64
3
HC,SC,Head,SL,EL,HL,SR,ER,AL,KR
25.00
43.31
4
S,SC,HL,SR,ER,HL,AL,HipR,KR
27.78
40.64
5
HC,SC,H,SR,HipL,KL,AL,KR,AR
51.35
42.78
mean
35.27
40.96
sd
11.84
2.32
Table 5.5:Joint selection results.For joint labels see Figure 3.2.
The results for different meta-features are shown in Table 5.6,the mean
accuracy is 32:00% and sd is 5:99.Sd is smaller then for joints,but the
mean classification accuracy is also smaller.The selected features were
tested on the entire dataset and results indicates that smaller amount of
features can produce the same or not significantly different classification
as using all features.Using smaller amount of features could speed up the
computational process.The results also show that there are features that
are selected all the time,for example L1 from Posture Group,meaning
that this is a strong feature for emotion recognition.
Combined Result
Figure 5.15 shows the overall classification accuracy for raw data,meta-
features and combined data (raw data plus meta features).The highest
classification is achieved by combined data,the worst for raw data.
The confusion matrix of the combined result is shown in a Figure 5.16
and each emotion precision and recall is shown in a Figure 5.17.Disgust
and boredom are the most correct recognised emotions.
5.4.RESULTS 58
Accuracy (%)
Fold
Joint selection
testing set
entire set
1
AverageChangeOfRate(P1,P4,H3,BL;H,LL)
RelativeMovement(P0,P4,P5,H3,BL;RA)
SmoothJerk(P0,P2,P3,L1,BL;LAz,RAx,RAz,Hy,Cx,Cy)
BL,P2,H2,L1
33.33
44.39
2
AverageChangeOfRate(P4,H1,H3,OP,RL)
RelativeMovement(P3,H1,L1,BL,OP;H,LL)
SmoothJerk(P0,P3,P4,L1,BL;LAx,LAy,LAz,RAx,Hy,Cy,Cz,LLz,RLx,RLy)
BL,OP,P5,L1
27.78
47.06
3
AverageChangeOfRate(P0,P4,L1;C)
RelativeMovement(P4,P5,H1,H3,BL,OP;RA,C,LL,RL)
SmoothJerk(P0,P1,H1,L1,BL;LAy,Hy,Cy,LLx,LLz,RLx,RLz)
OP,P0,P1,P5,L1
33.33
45.45
4
AverageChangeOfRate(P0,P1,P4,P5,H1,H2,OP;LA,LL)
RelativeMovement(P0,P1,H2,L1,BL;RA,LL)
SmoothJerk(P0,P1,P2,P3,P5,H1,H2,L1;LAx,LAz,RAx,Hy,Cy,RLx)
BL,P5,L1
25.00
48.67
5
AverageChangeOfRate(P0,P3,P4,H1,L1,BL,OP;H,LL)
RelativeMovement(P0,P1,P3,P5,H1,H2,BL,OP;LA,RA,H)
SmoothJerk(P0,P1,P2,P5,H1,H2,BL;LAx,LAy,RAx,Hy,Hz,Cy,Cz,RLx)
BL,P0,P1,P3,P5,H3,L1
40.54
50.80
mean
32.00
47.27
sd
5.99
2.55
Table 5.6:Meta-feature selection results.For feature labels see Table 4.2 and Figure
4.11.
0
10
20
30
40
50
60
43.32
48.66
55.62
Raw Data
Meta - Features
Combined
Classification (%)
Figure 5.15:Overall results.
5.4.RESULTS 59
Figure 5.16:Combined result confusion matrix.
0.0
0.2
0.4
0.6
0.8
1.0
0.59
0.55
0.46
0.49
0.44
0.42
0.63
0.66
0.68
0.66
precision
recall
Curious
Joy
Confusion
Boredom
Disgust
Emotion
Amount
Figure 5.17:Emotion recognition result.
5.4.RESULTS 60
Emotion Identification Results
In this subsection,the observation of the effectiveness of the meta-
features (each individual group) and raw data are presented to see which