Automatic Analysis of Facial Expressions: The State of the Art Maja ...

brasscoffeeΤεχνίτη Νοημοσύνη και Ρομποτική

17 Νοε 2013 (πριν από 4 χρόνια και 1 μήνα)

432 εμφανίσεις

Automatic Analysis of Facial
Expressions:


The State of the Art

By Maja Pantic, Leon Rothkrantz

Presentation Outline


Motivation


Desired functionality and evaluation criteria


Face Detection


Expression data extraction


Classification


Conclusions and future research

Motivation


HCI


Hope to achieve robust communication by recovering from
failure of one communication channel using information
from another channel


According to some estimates, the facial expression of the
speaker counts for 55% of the effect of the spoken message
(with the voice intonation contributing 38%, and the verbal
part just 7%)


Behavioral science research


Automation of objective measurement of facial activity


Desired Functionality


Human visual system = good reference point


Desired properties:


Works on images of people of any sex, age, and ethnicity


Robust to variation in lighting


Insensitive to hair style changes, presence of glasses, facial
hair, partial occlusions


Can deal with rigid head motions


Is real
-
time


Capable of classifying expressions into multiple emotion
categories


Able to learn the range of emotional expression by a particular
person


Able to distinguish all possible facial expressions (probably
impossible)

Overview


Three basic problems need to be solved:


Face detection


Facial expression data extraction


Facial expression classification


Both static images and image sequences have
been used in studies surveyed in the paper

Face Detection


In arbitrary images


A. Pentland et al.


Detection in a single image


Principal Component Analysis is used to generate a face space
from a set of sample images


A face map is created by calculating the distance between the local
subimage and the face space at every location in the image


If the distance is smaller than a certain threshold, the presence of a
face is declared


Detection in an image sequence


Frame differencing is used


The difference image is thresholded to obtain motion blobs


Blobs are tracked and analyzed over time to determine if motion is
caused by a person and to determine the head position



Face Detection (Continued)


In face images


Holistic approaches (the face is detected as a whole unit)


M. Pantic, L. Rothkrantz


Use a frontal and a profile face images


Outer head boundaries are determined by analyzing the horizontal and
vertical histograms of the frontal face image


The face contour is obtained by using an HSV color model based
algorithm (the face is extracted as the biggest object in the scene having
the Hue parameter in the defined range)


The profile contour is determined by following the procedure below:


The value component of the HSV color model is used to threshold the
input image


The number of background pixels between the right edge of the image and
the first “On” pixel is counted (this gives a vector that represents a discrete
approximation of the contour curve)


Noise is removed by averaging


Local extrema correspond to points of interest (found by determining zero
crossings of the 1st derivative)





Face Detection (Continued)



Analytic approaches (the face is detected by
detecting some important facial features first)


H. Kobayashi, F. Hara


Brightness distribution data of the human face is obtained with
a camera in monochrome mode


An average of brightness distribution data obtained from 10
subjects is calculated


Irises are identified by computing crosscorrelation between the
average image and the novel image


The locations of other features are determined using relative
locations of the facial features in the face



Template
-
based facial expression
data extraction using static images


Edwards et al.


Use Active Appearance Models (AAMs)


Combined model of shape and gray
-
level appearance


A training set of hand
-
labeled images with landmark
points marked at key positions to outline the main
features


PCA is applied to shape and gray level data separately,
then applied again to a vector of concatenated shape and
gray level parameters


The result is a description in terms of “appearance”
parameters


80 appearance parameters sufficient to explain 98% of
the variation in the 400 training images labeled with 122
points


Given a new face image, they find appearance parameter
values that minimize the error between the new image
and the synthesized AAM image


Feature
-
based facial expression data
extraction using static images


M. Pantic, L. Rothkrantz


A point
-
based face model is used


19 points selected in the frontal
-
view image, and 10 in the
side
-
view image


Face model features are defined as some geometric
relationship between facial points or the image intensity
in a small region defined relative to facial points (e.g.
Feature 17 = Distance KL)


Neutral facial expression analyzed first


The positions of facial points are determined by using
information from feature detectors


Multiple feature detectors are used for each facial feature
localization and model feature extraction


The result obtained from each detector is stored in a
separate file


The detector output is checked for accuracy


After “inaccurate” results are discarded, those that were
obtained by the highest priority detector are selected for
use in the classification stage

Template
-
based facial expression
data extraction using image
sequences


M. Black, Y. Yacoob


Do not address the problem of initially locating the
various facial features


The motion of various face regions is estimated
using parameterized optical flow


Estimates of deformation and motion parameters
(e.g. horizontal and vertical translation, divergence,
curl) are derived

Feature
-
based facial expression data
extraction using image sequences



Cohn et al. (the only surveyed method)


Feature points in the first frame manually marked with a
mouse around facial landmarks


A 13x13 flow window is centered around each point


Hierarchical optical flow method of Lucas and Kanade used
to track feature points in the image sequence


Displacement of each point calculated relative to the first
frame


The displacement of feature points between the initial and
peak frames used for classification

Classification


Two basic problems:


Defining a set of categories/classes


Choosing a classification mechanism


People are not very good at it either


In one study, a trained observer could classify only 87% of the faces
correctly


Expressions can be classified in terms of facial actions that cause
an expression or “typical” emotions


Facial muscle activity can be described by a set of codes


The codes are called Action Units (AUs). All possible, visually detectable
facial changes can be described by a set of 44 AUs. These codes form the
basis of Facial Action Coding System (FACS), which provides a linguistic
description for each code.

Classification (continued)


Most of the studies perform an
emotion classification and use the
following 6 basic categories:
happiness, sadness, surprise, fear,
anger, and disgust


No agreement among psychologists
whether these are the right
categories


People rarely produce “pure”
expressions (e.g. 100% happiness),
blends are much more common


Template
-
based classification using
static images


Edwards et al.


The Mahalanobis distance measure can be used for classification









Classification into 6 basic + neutral categories


Correct recognition of 74% reported




c is the vector of appearance parameters for the new image,


is the centroid of the multivariate distribution for class i, and C
-
1

is the within
-
class covariance matrix for all the training images

Neural network
-
based classification
using static images


H. Kobayashi, F. Hara


Used 234x50x6 neural network trained off
-
line using
backpropagation


The input layer units correspond to intensity values extracted
from the input image along the 13 vertical lines


The output units correspond to the 6 basic emotion
categories


Average correct recognition rate 85%

Neural network
-
based classification
using static images (Continued)


Zhang et al.


Used 680x7x7 neural network


Output units represent six basic emotion categories plus the
neutral category


Output units give a probability of the analyzed expression
belonging to the corresponding emotion category


Cross
-
validation used for testing


J. Zhao, G. Kearney


Used 10x10X3 neural network


Neural network trained and tested on the whole set of data
with 100% percent recognition rate


Rule
-
based classification using static
images


M. Pantic, L. Rothkrantz (the only surveyed method)


Two
-
stage classification:


1. Facial actions (corresponding to one of the Action Units) are deduced from
changes in face geometry


Action Units are described in terms of face model feature values (E.g. AU 28 = (Both)
lips sucked in = feature 17 is 0, where feature 17 = Distance KL)


2. The stage 1 classification results are used to classify the expression into one of
the emotion categories


E.g. AU6 + AU12 + AU16 + AU25 => Happiness


The two
-
stage classification process allows “weighted emotion labels”


Assumption: each AU that is part of the AU
-
coded description of a “pure”
emotional expression has the same influence on the intensity of that emotional
expression


E.g. If the analysis of some image results in the activation of AU6, AU12, and
AU16, then the expression is classified as 75% happiness


The system can distinguish 29 AUs


Recognition rate 92% for upper face Aus, and 86% for lower face AUs


Template
-
based classification using
image sequences


Cohn et al.


Classification in terms of Action Units


Uses Discriminant Function Analysis


Deals with each face region separately


Used for classification only (i.e. all facial point
displacements are used as input)


Does not deal with image sequences containing
several consecutive facial actions


Recognition rate: 92% in the brow region, 88% in
the eye region, 83% in the nose and mouth region


Rule
-
based classification

using image sequences


M. Black, Y. Yacoob (the only surveyed method)


Mid
-

and high
-
level descriptions of facial actions are used


The parameter values (e.g. translation, divergence) derived from
optical flow are thresholded


E.g. Div >0.02 => expansion, Div <
-
0.02 => contraction. This is what
the authors would call a mid
-
level predicate for the mouth.


High
-
level predicates are rules for classifying facial expressions


Rules for detecting the beginning and the end of an expression


Use the results of applying mid
-
level rules as input


E.g. Beginning of surprise = Raising brows and vertical expansion of
mouth, End of Surprise = Lowering brows and vertical contraction of
mouth


The rules used for classification are not designed to deal with
blends of emotional expressions (Anger + Fear recognized as
disgust)


Recognition rate: 88%

Conclusions and Possible Directions
for Future Research


Active research area


Most surveyed systems rely on the frontal view of the face and
assume no facial hair or glasses


None of the surveyed systems can distinguish all 44 AUs defined
in FACS


Classification into basic emotion categories in most surveyed
studies


Some reported results are of little practical value


The ability of the human visual system to “fill in” missing parts
of the observed face (i.e. deal with partial occlusions) has not
been investigated

Conclusions and Possible Directions
for Future Research (Continued)



Not clear at all whether the 6 “basic” emotion categories are
universal


Each person has his/her own range of expression intensity


so
systems that start with a generic classification and then adapt
may be of interest


Assignment of a higher priority to upper face features by the
human visual system (when interpreting facial expressions) has
not been subject of a lot of research


Hard or impossible to compare reported results objectively
without a well
-
defined, commonly used database of face images

References


M. Pantic, L. Rothkrantz, “Automatic Analysis of Facial Expressions: The State of the
Art”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, No.
12, December 2000


M. Pantic, L. Rothkrantz, Expert System for Automatic Analysis of Facial
Expressions, Image and Vision Computing, Vol. 18, No. 11, pp. 881
-
905, 2000


M. J. Black, Y. Yacoob, “Recognizing Facial Expressions in Image Sequences Using
Local Parameterized Models of Image Motion”, Int’l J. Computer Vision, Vol. 25, no.1,
pp. 23
-
48, 1997


J. F. Cohn, A.J. Zlochower, J.J. Lien, T. Kanade, “Feature
-
Point Tracking by Optical
Flow Discriminates Subtle Differences in Facial Expression”, Proc. Int’l Conf.
Automatic Face and Gesture Recognition, pp. 396
-
401, 1998


G.J. Edwards, T.F. Cootes, C.J. Taylor, “Face Recognition Using Active Appearance
Models”, Proc. European Conference on Computer Vision, Vol. 2, pp. 581
-
695, 1998



G.J. Edwards, T.F. Cootes, C.J. Taylor, “Active Appearance Models”, Proc. European
Conf. Computer Vision, Vol. 2, pp. 484
-
498, 1998


H. Kobayashi, F. Hara, “Facial Interaction between Animated 3D Face Robot and
Human Beings”, Proc. Int’l Conf. Systems, Man, Cybernetics, pp.3,732
-
3,737, 1997

Some YouTube Videos


Real
-
time facial expression recognition


Take 2


Facial expression recognition


Facial expression mirroring


Facial expression animation