國立臺灣大學電機資訊學院電信工程學研究所
教程指南
Graduate Institute of Communication Engineering
College of Electrical Engineering and Computer Science
National Taiwan University
Tutorial
自動化臉部表情辨識系統
Automatic facial expression recognition system
劉俊佐
J
u
n
g

Zuo
Liu
指導教授：丁建均
博士
Advisor: Jian

Jiun Ding, Ph.D.
中華民國
100
年
6
月
June, 20
11
ABSTRACTION
Human
beings can detect facial expression without any effort, while it is difficult
for computer to do this job.
Therefore
, development of an
automated system that
is
powerful for analy
ze facial expression is still not good enough
. There are several
related problems: d
etection of an image segment as
a face, extraction of the facial
expression information, and classification
of the
expression. A system that
performs
these operations
accurately and in real time would form a big step in achieving a
human

like interaction bet
ween man and
machine.
In this tutorial we illustrate some
techniques that are possible to improve the performance of automatic facial
expression analysis.
INTRODU
CTION
Facial expressions play an essential role in human
communication.
I
n faceto

face
human communication only 7% of the communicative
message is due to linguistic
language, 38% is due to
paralanguage, while 55% of it is transferred by facial
expressions.
I
n this tutorial, we illustrate
the techniques used in the facial expression
analysis system we
designed
. The facial expression analysis system
consists of three
parts, namely, feature extraction, dimension reduction and facial expression analysis.
When an image containing a specific facial expression is coming to our system, first
we extract the
important features which may
keep the information
about the
expression. Because these important features might be very high dimensional vectors,
which result in the increasing difficulty for an excellent classifier to apply, we take
the step in reducing the dimension of the features. After the dimension r
eduction, the
lower dimensional features will be fed into the last part, facial expression classifier.
Then, we can get the final output of our system.
In the following chapter
s, we introduce some key techniques in each part
respectively.
Chapter 1
illustr
ate the main features we use to represent the image
,
chapter 2
explain some
principal
dimension reduction
techniques
which are been
widely used
and chapter 3
introduce the concept of support vector machine which
may be the most important and useful classif
ier.
Figure
1
: The
framework
of our facial expression analysis system
CHAPTER
1: F
EATURE
E
XTRATION
1.1
Multistate Feature

Based Action Unit Recognition
In this section, we describe our multistate feature

based action unit(
AU)
recognition system, which explicitly analyzed appearance changes in localized
facial features in a nearly frontal image sequence. Since each AU is associated
with a specific set of facial muscles, we believe that accurate geometrical
modeling and track
ing of facial features will lead to better recognition results.
Figure
2
depicts
the overall structure of the Automatic Face Analysis (AFA).
Given an image sequence, the region of the face and approximate location of
individual face features are detected automatically in the initial frame. Both
permanent (e.g., brows, eyes, lips) and
transient (lines and furrows) face feature
changes are automatically
detected
and tracked in the image sequence
. Informed
by Face Action Coding System(FACS) AUs, we group the facial features into
separate
collection
s of feature parameters because the facia
l actions in the upper
and lower face are relatively independent for AU recognition. In the upper face,
15 parameters describe shape, motion, eye state, motion of brow and cheek, and
furrows. In the lower face, nine parameters describe shape, motion, lip s
tate, and
furrows. These parameters are geometrically normalized to compensate for image
scale and in

plane head motion.
Figure 2: Feature

based automatic facial action analysis system(AFA)
1.2
Multistate Face Component Models
To detect and track changes of facial components in near frontal images, we
develop multistate facial component models. The models are illustrated in Table 1,
which includes both
permanent
and transient components. A
three

state lip model
describes lip sta
te: open, closed, and tightly closed. A two

state model (open or
closed) is used for each of the eyes. Each brow and cheek has a one

state model.
Transient facial features, such as nasolabial furrows
, have two states: present and
absent.
Table 1: Upper
Face Action Units and Some Combinations
Table 2
: Multistate Facial Component Models of a Frontal Face
CHAPTER 2 DIMENSION REDUCTION
2.1 The PCA Space
The central idea of principal component analysis
(PCA) is to reduce the
dimensionality of a data set consisting of a large number of interrelated variables,
while retaining as much as possible of the variation present in the data set. This is
achieved by transforming to a new set of variables, the princi
pal component (PCs),
which are uncorrelated, and which are ordered so that the first few retain most of the
variation present in all of the original variables.
Therefore, we can represent an image in a
t

dimensional vector space and use
Principal component
analysis to find a subspace whose basis vectors corresponds to
the maximum

variance directions in the original space. Let W represent the linear
transformation that maps the original t

dimensional space onto a f

dimensional feature
subspaces where
f t
. The new feature vectors
f
i
y
and defined by
1.....
T
i i
y W x i N
. The columns of W are the eigenvalues
i
e
obtained by solving
the eigenstructure decomposition
i i i
e Qe
, where
T
Q XX
is the covariance matrix
and
i
is
the
eigenvalue associated with the eigenvector
i
e
. Before obtaining the
eigenvectors of Q, we should take following two steps: (1) the vectors are normalized
such that
i
x
to make
the system invariant to the intensity of the illumination source.
(2) the average of all images is subt
racted from all normalized vectors to ensure that
the eigenvector with the highest eigenvalue represents the dimension in the
eigenspace in which variance of vectors is maximum in a co
rrelation sense.
Figure
3
: Gaussian Samples
Figure
4
:
Gaussian Samples with eigenvectors of sample covariance matrix
Figure
5
: PC projected samples
Figure
6
: PC dimensionality reduction step
2.2 The LDA Space
Linear Discriminant Analysis(LDA) searches for those vectors in
the underlying
space that best
discriminate among classes(rather than those that best describe the data ). More formally, given a
number of independent features relative to which the data is described, LDA creates a linear
combination of these which yields the largest me
an differences between the desired classes.
Mathematically speaking, for all the samples of all classes, we d
efine two measures: (1) one is called
within

class scatter matrix, as given by
1 1
( ) ( )
j
N
c
j t j t
w i j i j
j i
S x x
Eq.
1
where
j
i
x
is the
th
i
sample of class
j
,
j
is the mean of class
j
,
c
is the number of
classes, and
j
N
the number of samples in class
j
; and (2) the other
is called
between

class scatter matrix
1
( )( )
c
t
b j j
j
S
Eq.
2
w
here
represents the mean of all classes.
The goal is to maximize the between

class measure while minimizing the within

class
measure. One way to do this is to maximize the ratio
det
det
b
w
S
S
. The advantage of using
this
ration is that it has been proven that if
w
S
is a nonsingular matrix then this ratio is
maximized when the column vectors of the projection matrix
W
, are the eigenvectors
of
1
w b
S S
. I
t should be noted that: (1) there are at most
1
c
nonzero generalized
eigenvectors and , so, an upper bound on
f
is
1
c
, and (2) we require at least
t c
samples to guarantee that does not become singular(which is almost impossible in any
realistic application). To solve this, someone proposed the use of an intermediate
space. In both cases, this intermediate space is chosen to be the PCA space. Thus, the
original t

dimensional space is projected onto an intermediate g

dimensional space
using PCA and then onto a final f

dimensional space using LDA.
Following let us generalize the LDA case by case:
Case1: Project the data points
from
two
different
cla
sses
into one

di
Figure
1
mensional space
The
goal we want to achieve is
to find a vector
w
, and project the data points onto
w
, thus we
get a new coordinate
y
.
t
y w x
Eq.
3
The concept of LDA is to project the data
points
of the same class onto the new
coordinate as more close
as possible while
the data from different classes as more far
away
as possible. To describe the concept, we need some quantitative va
riables to
represent that
, the first is the mean of each class data.
1
i
i
x D
i
m x
n
Eq.
4
The mean of each class data after projecting is :
1 1 1
i i
t t t
i
i
x D x D
i i i
m y w x w x w m
n n n
Eq.
5
Where
i
n
is the number of data of the
th
i
class.
i
D
is the collection of the
th
i
class
data.
i
Y
is the collection of the projected
th
i
class data.
Furthermore
,
w
e can find that the mean of each class after projecting is the projecti
on
of the mean of original data in the high dimensional space.
Next we define the distance between the two classes after projecting.
1 2 1 2
( )
t
m m w m m
Eq.
6
Also, define
the scatter in each class after projecting
2 2
( )
i
i i
y Y
s y m
Eq.
7
Based
on the concept of LD
A, the further far
the projected data
of each class separate
each other the
larger the mean difference. The more concentrated the projected data of
the same class the smaller the divergence within each class.
We can represent the concept mathematically
as below:
2
1 2
2 2
1 2
( )
m m
J w
s s
Eq.
8
One way to do this is to maximize the ratio.
( )( )
i
t
i i i
x D
S x m x m
Eq.
9
2
2
( )
( )( )
i
i
t t
i i
x D
t t
i i
x D
t
i
s w x w m
w x m x m w
w S w
Eq.
10
Let
1 2
w
S S S
2 2
1 2 1 2
( )
t t
w
s s w S S w wS w
Eq.
11
2
2
1 2 1 2 1 2 1 2
( ) ( )( )
t t t t
t
B
m m wm wm w m m m m w
w S w
Eq.
12
So we get that
t
B
t
w
w S w
J w
w S w
The
w
make the largest
J w
satisfy the
following equation:
B W
S w S w
Eq.
13
This is a generalized eigenvalue problem.
When is
W
S
in
versible the equation above is a simple eigenvalue problem
1
W B
S S w w
Eq.
14
But the direction of
B
S w
is
1 2
( )
m m
, the solution is the equation below exactly, we
don’t to solve the eigenvalue problem.
1
1 2
( )
W
w S m m
Eq.
15
Case2: Project multiple class data onto high dimensional space
Now we make some change to satisfy the need for multiple data and high dimensional
space. First, we convert int
o the version for multiple data (class number >2)
1
c
W i
i
S S
Eq.
16
Then convert
B
S
into following equation:
1
( )( )
c
t
B i i i
i
S n m m m m
Eq.
17
Note the difference between
B
S
of case1.
General speaking
B
S
still describe the divergence within each class.
While projecting data onto higher dimensional space, we no
more request the only
one vector
w
, but seek a set of basis. Consequently, we represent some
w
in a matrix
form
W
, each column is a basis. So the
denominator
and
numerator
of original
equation become:
t t
B B W W
S W S W S W S W
Eq.
18
The
( )
J W
of Case1 becomes:
( )
t
B
t
w
W S W
J W
W S W
Eq.
19
Note that the
W
in the
( )
J W
is a matrix which represent a set of basis.
As sh
own in , the basis derived by PCA generate the minimum error(Euclidean
distance).
On the other hand the basis derived by LDA is very different from those of PCA.
From the
figure we can induce that the data points projected onto the LDA basis are
separated into two classes while PCA can’t separate that.
Figure
7
: The difference between PCA and LDA
CHAPTER 3
FACIAL EXPRESSION CLASSIFIER
3.1 Introduction to Machine
Learning
Machine learning is programming computers to optimize a performance
criterion
using example data or past experience. We have a model defined up to some
parameters, and learning is the execution of a computer program to optimize the
parameters of t
he model using the training data or past experience. The model may be
predictive to make predictions in the future, or descriptive to gain knowledge from
data, or both.
General speaking, there are two major
types of learning: supervised learning and
unsup
ervised learning with respect to the input data.
In supervised learning, the aim is to learn a mapping from the input to an output
whose correct values are provided by a supervisor. In unsupervised learning, there is
no such supervisor and we only have in
put data. The aim of unsupervised learning is
to find regularities in the input. There is a structure to the input space such that certain
patterns occur more often than others, and we want to see what generally happens and
what does
not.Before we start to
introduce the machine learning notation
mathematically, we illustrate some notation.
We use to
( )
i
x
denote the “input”
variables, also called input features, and
( )
i
y
to
denote the “output”
or target variable that
we are trying to predict. A pair
( ) ( )
( )
i i
x y
is
called a training example, and the dataset that we’
ll be using to learn
a list of
m
training examples
( ) ( )
{( );1.....}
i i
x y i m
is called a training set. We will also use
denote the space of inpu
t values, and
the space of the output values. To describe the
supervised learning problem slightly more formally, our goal is, given a training set ,
to learn a function
:
h
so that
( )
h x
is a “good” predictor for
the
corresponding value of
y
.
T
his function
( )
h x
is called a hypothesis. The following
picture illustrate
s
the supervised le
arning process:
Figure
8
: The process of supervised learning
3.2 Linear Regression
When the target variable that we’re trying to predict is continuous, we call the
learning problem a regression problem. When
y
can take on only
a small number of
discrete values, we call it a classification problem.
Figure
9
: The example of linear regression
To perform supervised learning
, we decide to approximate
y
as a linear function
of
x
0 1 1 2 2
( )
h x x x
Eq.
20
w
here
the
i
’s
are the parameters parameterizing the space of linear functions
mapping from
to
. To
simplify our notation, we also introduce the convention of
letting
0
1
x
(this is called intercept term), so that
0
( )
n
T
i i
i
h x x x
Eq.
21
Now, given a training set , how do we l
earn the parameters
? One reasonable method
seems to make
( )
h x
close to
.
We define the cost function:
2
1
1
( ) ( ( ) )
2
m
i i
i
J h x y
Eq.
22
We
want to choose
so as to minimize
( )
J
. To do so, let
u
s use a search
algorithm that starts
with some “initial guess” for
, and that repeatedly
changes
to make
( )
J
smaller, until hopefully we converge to a value of
θ that minimize
( )
J
. Speciﬁcally, let
u
s consider the gradient descent
algorithm, which starts with some initial
, and repeatedly
performs the
update:
:
j j
j
J
Eq.
23
Here,
is called the learning rate. This is a very natural algorithm that
repeatedly
takes a step in the direction of steepest decrease of
J
.
In order to implement this algorithm, we have to work out what is the
partial derivative term on the right hand side. Lets ﬁrst work it out for the
case of if we have only o
ne training example
,
x y
, so that we can neglect
the sum in the deﬁnition of
J
. We have:
2
0
1
2
1
2
2
j j
j
n
i i
i
j
i
J h x y
h x y h x y
h x y x y
h x y x
Eq.
24
For a single training example, this gives the u
pdate rule:
:
i i i
j j j
y h x x
Eq.
25
The rule is called the LMS update rule
.
This rule has several properties that seem
natural and intuitive. For instance, the magnitude of the update is
proportional to the
error term
i i
y h x
.
We’d derived the LMS rule for when there was only a single training
example. There are two ways to modify this method for a training set of
more than one example. The ﬁrst is replace
it with the following algorithm:
Repeat until convergence{
1
:
m
i i i
j j j
i
y h x x
(for every
j
)
Eq.
26
}
We can easily verify that the quantity in the summation in the update
rule above is just
j
J
. So, this is simply gradient descent on the original cost
function
J
. This method looks at every example in the entire training set on every
step, and is called batch gradient descent. Note that, while grad
ient descent can be
susceptible
to local minima in general, the optimization problem we have posed here
for linear regression has only one global, and no other local, optima; thus
gradient descent always converges (assuming the learning rate α is not too
large) to the global minimum. Indeed,
J
is a convex quadratic function.
Here is an example of gradient descent as it is run to minimize a quadratic
function.
Figure
10
: The example of gradient descent
The ellipses shown above
are the contours of a quadratic function. Also shown is
the trajectory taken by gradient descent. The
x
’s
in the ﬁgure mark the successive
values of
that gradient descent went through.
Consider the problem
of predicting
y
from
x
. The leftmost ﬁgure below
shows the result of ﬁtting a
0 1 1
y x
to a dataset. We see that the data
doesn’t really lie on straight line, and so the ﬁt is not very good.
Instead, if
we had added an extra feature
2
x
, and ﬁt
2
0 1 1 2
y x x
,
then we
obtain a slightly better ﬁt to the data.
(See middle ﬁgure) Naively, it
might seem that
the more features we add, the
better. However, there is
also
a dang
er in adding too
many features.
The r
ightmost ﬁgure is the result of
ﬁ
tting a 5

th order poly
nomial
5
0
j
j
j
y x
. We see that even though the
ﬁ
tte
d curve passes through the data
perfect
ly, we would not expect this to
be a very good predictor.
Without formally
deﬁning what these terms m
ean, we’ll say the ﬁgure
on the left shows an instance of
underﬁ
tting
—
in which the data clearly
show
s structure not captured by the
mode
l
—
and the ﬁgure on the right is
an example of overﬁ
tting.
Figure
11
: The different situation of training data fitting
Figure
12
:
There are many ways to separate the data from different classes
3.3 Introduction to Support Vector Machine
We’ll start our introduction
on SVMs by talking about margins. This section will
give the intuitions about margins a
nd about the “conﬁdence” of our
prediction
.
Consider logistic r
egression, where the probability
1;
p y x
i
s modeled by
T
h x g x
. We
would then predict
“1” on an input
x
if and
only if
0.5
h x
, o
r
equivalently, if and only if
0
T
x
. Consider a
positive trainin
g example (
1
y
).
The larger
T
x
is, the larger also is
1;,
h x p y x w b
, and thus also the h
igher
our degree of “conﬁdence”
that the label is 1. Thus, informally we can t
hink of our
prediction as being
a ve
ry conﬁdent one that
1
y
if
0
T
x
.
Similarly, we think of
logistic
regression as making a very conﬁ
dent prediction of
0
y
, if
0
T
x
. Given
a training set, again informally it seems that we’d have found a good ﬁt to
the t
raining
data if we can ﬁnd
so that
0
T
x
whenever
1
i
y
, and
0
T
x
whenever
0
i
y
, since this would reﬂect a very conﬁdent (and
correct) set of
classiﬁcations for all the training examples. This seems to be
a nice goal to aim for, and we’ll soon formalize this idea using the notion of
functional margins.
For a di
ﬀ
erent type of intuition, consider the following ﬁgure, in which x’s
represent
positive training examples, o’s denote negative training examples,
a decision boundary (this is the line given by the equation
0
T
x
, and
is also called the separating hyperplane) is also shown, and three points
have also been labeled
A, B and C.
Figure
13
: The confident points at different locations of output space
No
te
that the point A is very far from the decision boundary. If we are
asked to make a
prediction for the value of
y
at A, it seems we should be
quite conﬁdent tha
t
1
y
there. Conversely, the point C is very close to
the decision boundary, and while it’s on the side of the decision boundary
on which we would predict
1
y
, it seems likely that just a small
change to
the decision boundary could easily have caused out prediction to be
0
y
.
Hence, we’re much more conﬁdent about our prediction at A than at C. The
point B lies in

between these two cases, and more broadly, we see that if
a po
int is far from the separating hyperplane, then we may be signiﬁcantly
more conﬁdent in our predictions. Again, informally we think it’d be nice if,
given a training set, we manage to ﬁnd a decision boundary that allows us
to make all correct and conﬁdent
(meaning far from the decision boundary)
predictions on the training examples.
To make our discussion of SVMs easier, we’ll ﬁrst need to introduce a new
notation for talking about classiﬁcation. We will be considering a linear
classiﬁer for a binary class
iﬁcation problem with labels
y
and features
x
.
From now, we’ll use
{ 1,1}
y
(instead of {0, 1}) to denote the class labels.
Also, rather than parameterizing our linear classiﬁer with the
vector
, we
will use parameters
,
w b
, and write our classiﬁer as
,
T
w b
h x g w x b
Eq.
27
Here,
( ) 1
g z
if
0
z
, and
( ) 1
g z
otherwise. This “
,
w b
” notation
allows us to explicitly treat the intercept term
b
separately from the other
parameters. (We also drop the convention we had previously of letting
0
1
x
be an
extra coordinate in the input feature vector.) Thus,
b
takes the role of
what was previously
0
, and
w
takes the role of
1
[.....]
T
n
.
Note also
that, from our deﬁnition of g above, our classiﬁer will directly
predict either 1 or −1
, without ﬁrst going
through the intermediate step of estimating
the probab
ility of y being 1.
Lets formalize the notions of the functional and geometric margins. Given
a
training example
(,)
i i
x y
, we deﬁne the functional margin of
(,)
w b
with
respect to the training example
.
i i
T
r y w b
Eq.
28
Note that if
1
i
y
, then for the functional margin to be large
, then we need
T
w b
to be a large positive number. Conversely, if
1
i
y
, then for the functional
margin to
be large, then we need
T
w b
to be a large negative number. Moreover,
if
0
i
T
y w b
, then our prediction on this example is correct.
Hence, a large
functional mar
gin represents a conﬁdent and a
correct prediction.
For a linear classiﬁer with the choice of
g
given above (taking values in
{−1, 1}), there’s one property of the functi
onal margin that makes it not a
very good
measure of conﬁdence, however. Given our choice of
g
, we note that
if we replace
w
with and
b
with , then since
2 2
T T
g w b g w x b
,
this would not change
,
w b
h x
a
t all.
g
and
,
w b
h x
depends
only on the sign,
but not on the magnitude, of
T
w b
.
However, replacing
(,)
w b
with
(2,2 )
w b
also results in multipl
ying our functional
margin by a
factor of 2. Thus, it seems that by exploiting our freedom to scale
w
and
b
,
we can make the functional margin arbitrarily large without really changing
anything meaningful. Intuitively, it might therefore make sense to impose
some sort of normalization co
ndition such as that
2
1
w
. That is
we might
replace
(,)
w b
with
2 2
(,)
w w b w
, and instead consider the functional
margin of
2 2
(,)
w w b w
.
Given a training set
(,);1,....,
i i
S x y i m
, we also
deﬁne the
function mar
gin of
(,)
w b
with respect to
S
as the smallest of the functional
margins of the individual
training examples. Denoted by
ˆ
, this can therefore
be written:
1,...,
ˆ
ˆ
min
i
i m
Eq.
29
Next, lets talk about geometric margins. Consider the picture below:
Figure
14
: Illustration of the margins.
The decision boundary corresponding to
(,)
w b
i
s shown, along with
the
vector
w
. Note that
w
is orthogonal to the separating hyperplane. Consider the point at
A, which represents the input
i
x
of some training example with
label
1
i
y
.
Its distance to the decision boundary,
i
,
is given by the line
segment AB.
How can we ﬁnd the value of
i
?
w w
is a unit

length vector
pointing in the same direction as
w
.
Since A represents
i
x
,
we therefore ﬁnd
that the point B is given by
i i
x w w
. But this
point lies on
the decision boundary, and all points
x
on the decision boundar
y satisf
y the
equation
0
T
w b
.
Hence,
0
i i
T
w x w w b
Eq.
30
Solving for
i
yields
T
i
T
i i
w x b w b
x
w w w
This was worked out for the case of a positive
training example at A in the
ﬁ
gure,
where being on the “positive” side of the decision boundary is good.
More generally, we deﬁne the geometric margin of
(,)
w b
with respect to a
training example
(,)
i i
x y
to be
T
i i i
w b
y x
w w
Eq.
31
Note that if
1
w
, then the functional margin equals the geometric
margin
—
this thus gives us a way of relating these two di
ﬀ
erent notions of
margin. Also, the geometric
margin is invariant to rescaling of the parameters
.
That is
if we replace
w
with
2
w
and
b
with
2
b
, then the geometric margin
does not
change. This will in fact come
in ha
ndy later. Speciﬁcally, because
of this invariance
to the scaling of the parameters, when trying to ﬁt
w
and
b
to training data, we can
impose an arbitrary scaling constraint on
w
without changing anything important. F
or
instan
ce, we can demand that
1
w
, or
1 2
2
w b w
,
and any of these can be
satisﬁed simply by
rescaling
w
and
b
.
Finall
y, given a
training set
(,);1,....,
i i
S x y i m
,
we also deﬁne
the geometric margin of
(,)
w b
with respect
to
S
to be the smallest of the
geometric margins on the individual training examples:
1,...,
ˆ
ˆ
min
i
i m
Eq.
32
Given a training set, it seems from our previous discussion that a natural
desideratum is to try to ﬁnd a decision boundary that maximizes the
margin, since this
would reﬂect a very conﬁdent set of
predictions on the training set and a good “ﬁt” to
the t
raining data. Speciﬁcally, this
will result in a classiﬁer that separates the pos
itive
and the negative training
examples with a “gap”.
For now, we will assume that we a
re given a training set that is
linearly
separable.
I
t is possible to separate the positive and negative examples using some separating
hyperplane. How we
ﬁnd the one that
achieves the maximum geometric margin? We
can pose the following optimization problem:
,,
max
.. , 1,...,
1.
r w b
i i
T
s t y w x b i m
w
Eq.
33
W
e want to maximize
, subject to each training example having functional
margin at least
. The
1
w
const
raint moreover ensures that the
function
al margin
equals to the geometric margin, so we are als
o guaranteed
that all the geometric
margins are at least γ.
Thus, solving this problem will
result in
(,)
w b
with the largest
possible geome
tric margin with respect to the
training
set.
If we could solve the
optimization problem above, we’d be done. But the“
1
w
” constraint is a nasty
one,
and this problem certainly
isn’t in any format that we can plug into st
andard
optimization software to
solve. So, lets try t
ransforming the problem into a nicer one.
Consider:
,,
ˆ
max
ˆ
.. , 1,...,
r w b
i i
T
w
s t y w x b i m
Eq.
34
Here, we’re going to maximize
ˆ
w
, subject to the functional margins all
being at least
ˆ
.
Since the geometric and functional margins are related by
ˆ
w
,
this will give us the answer we want. Moreover, we’ve gotten rid
of the constraint
1
w
that we didn’t like. The downside is that we now
have a nas
ty
objective
ˆ
w
function
and
we still don’t
have any o
ﬀ

the

shelf software that can sol
ve
this form of an optimization
problem.
Lets keep going. Recall our earlier discussion that we can add an arbitrary
scaling constraint on
w
and
b
without changing
anything. This is the key idea
we’ll use
now. We will introduce the scaling
constraint that the functional
margin of
w
,
b
with
respect to
the training set must be 1:
ˆ
1
Since multiplying
w
and
b
by some constant results in the functional mar
gin
being
multiplied by that same constant, this
is indeed a scaling constraint,
and can be
satisﬁed by rescaling
w
,
b
. Pluggi
ng this into our problem above,
and noting that
maximizing is the same thing as minimizing, we now have the following optimization
problem:
2
,,
1
max
2
.. 1, 1,...,
r w b
i i
T
w
s t y w x b i m
Eq.
35
We’ve now transformed the problem into a form that can be e
ﬃ
ciently
solved.
The above is an optimization problem with a convex quadratic objective and only
linear constraints. Its solution gives us the optim
al margin classiﬁer.
CONCLUSION
We illustrate our facial expression analysis system
and explain each part of our
system. In each part of our system, we introduce many key techniques u
sed in our
system. We have an idea about what the AU is and we understand the advantage of the
AFA system. We also know the difference between PCA and LDA. Furthermore, we
have the basis concept of support vector machine. In the future, we should try more
types of feature to seek the most appropriate one to represent the input image, and we
should reveal the association between different classifier and different
feature
such
that we can find the best combination to perform the facial expression analysis.
RE
FERENCE
[1] Facial Expression Coding Project, cooperation and competition
between Carnegie Mellon Univ. and Univ. of California, San
Diego, unpublished, 2000.
[2
] T.
Kanade, J. Cohn, and Y. Tian,
Comprehensive Database for
Facial Expression Analysis,º
Proc. Int'l Conf. Face and Gesture
Recognition, pp. 46

53, Mar. 2000.
[
3
] G. Donato, M.S. Bartlett, J.C. Hager, P. Ekman, and T.J. Sejnowski,
Classifying Facial Actions,
IEEE Trans. Pattern Analysis and
Machine Intelligence, vol. 21, no. 10, pp. 974

989,
Oct. 1999.
[
4
]
R.R. Rao,
Audio

Vi
sual Interaction in Multimedia,
PhD thesis,
Electrical Eng., Georgia Inst. of Technology, 1998.
[
5
] H.A. Row
ley, S. Baluja, and T. Kanade,
Neural Network

Based
Face Detection,
IEEE Trans. Pattern Analysis and Machine
intelligence, vol. 20, no. 1, pp. 23

38, Jan. 1998.
[
6
] L.S. Chen, Joint processing of audio
–
visual information for the recog
nition of
emotional expressions
in human
–
computer interaction, Ph.D. Thesis, University of
Illinois at Urbana

Champaign,
Department
of Electrical Engineering, 2000.
[
7
] J. Lien, Automatic recognition of facial expressions using hidden
Markov models
and estimation of
expression intensity. Ph.D. Thes
is, Carnegie Mellon
University,
1998.
[
8
] T. Otsuka, J. Ohya, A study of transformation of
facial expressions
based on
expression recognition
from temproal image sequences, Technical Report, Institute
of
Electronic, Information, and
Communications Engineers (IEICE), 1997.
[
9
] L.R. Rabiner, B. Juang, Fundamentals of Speech Recognition, Prentice

Hall,
Englewood Cliffs, NJ,
1993.
[1
0
] P. Ekman, W.V. Friesen, Facial Action Coding System: Investigator_s
Guide,
Consulting Psychologists
Press, Palo Alto, CA, 1978.
Comments 0
Log in to post a comment