Automatic facial expression recognition system

thunderingaardvarkAI and Robotics

Nov 18, 2013 (3 years and 8 months ago)

72 views

國立臺灣大學電機資訊學院電信工程學研究所

教程指南

Graduate Institute of Communication Engineering

College of Electrical Engineering and Computer Science

National Taiwan University

Tutorial

自動化臉部表情辨識系統


Automatic facial expression recognition system




劉俊佐

J
u
n
g
-
Zuo

Liu


指導教授:丁建均

博士

Advisor: Jian
-
Jiun Ding, Ph.D.


中華民國
100

6


June, 20
11


ABSTRACTION


Human

beings can detect facial expression without any effort, while it is difficult
for computer to do this job.

Therefore
, development of an

automated system that
is
powerful for analy
ze facial expression is still not good enough
. There are several
related problems: d
etection of an image segment as

a face, extraction of the facial
expression information, and classification

of the

expression. A system that

performs
these operations
accurately and in real time would form a big step in achieving a
human
-
like interaction bet
ween man and

machine.

In this tutorial we illustrate some
techniques that are possible to improve the performance of automatic facial
expression analysis.



INTRODU
CTION

Facial expressions play an essential role in human

communication.
I
n faceto
-
face
human communication only 7% of the communicative

message is due to linguistic
language, 38% is due to

paralanguage, while 55% of it is transferred by facial

expressions.

I
n this tutorial, we illustrate

the techniques used in the facial expression
analysis system we
designed
. The facial expression analysis system

consists of three
parts, namely, feature extraction, dimension reduction and facial expression analysis.
When an image containing a specific facial expression is coming to our system, first
we extract the
important features which may

keep the information
about the
expression. Because these important features might be very high dimensional vectors,
which result in the increasing difficulty for an excellent classifier to apply, we take
the step in reducing the dimension of the features. After the dimension r
eduction, the

lower dimensional features will be fed into the last part, facial expression classifier.
Then, we can get the final output of our system.

In the following chapter
s, we introduce some key techniques in each part
respectively.
Chapter 1

illustr
ate the main features we use to represent the image
,
chapter 2

explain some
principal

dimension reduction
techniques
which are been
widely used

and chapter 3

introduce the concept of support vector machine which
may be the most important and useful classif
ier.






Figure
1
: The
framework

of our facial expression analysis system


CHAPTER

1: F
EATURE

E
XTRATION

1.1

Multistate Feature
-
Based Action Unit Recognition

In this section, we describe our multistate feature
-
based action unit(
AU)
recognition system, which explicitly analyzed appearance changes in localized
facial features in a nearly frontal image sequence. Since each AU is associated
with a specific set of facial muscles, we believe that accurate geometrical
modeling and track
ing of facial features will lead to better recognition results.

Figure
2
depicts

the overall structure of the Automatic Face Analysis (AFA).
Given an image sequence, the region of the face and approximate location of
individual face features are detected automatically in the initial frame. Both
permanent (e.g., brows, eyes, lips) and
transient (lines and furrows) face feature
changes are automatically
detected

and tracked in the image sequence
. Informed
by Face Action Coding System(FACS) AUs, we group the facial features into
separate
collection
s of feature parameters because the facia
l actions in the upper
and lower face are relatively independent for AU recognition. In the upper face,
15 parameters describe shape, motion, eye state, motion of brow and cheek, and
furrows. In the lower face, nine parameters describe shape, motion, lip s
tate, and
furrows. These parameters are geometrically normalized to compensate for image
scale and in
-
plane head motion.


Figure 2: Feature
-
based automatic facial action analysis system(AFA)


1.2


Multistate Face Component Models

To detect and track changes of facial components in near frontal images, we
develop multistate facial component models. The models are illustrated in Table 1,
which includes both
permanent

and transient components. A

three
-
state lip model
describes lip sta
te: open, closed, and tightly closed. A two
-
state model (open or
closed) is used for each of the eyes. Each brow and cheek has a one
-
state model.
Transient facial features, such as nasolabial furrows
, have two states: present and
absent.



Table 1: Upper
Face Action Units and Some Combinations


Table 2
: Multistate Facial Component Models of a Frontal Face


CHAPTER 2 DIMENSION REDUCTION

2.1 The PCA Space

The central idea of principal component analysis

(PCA) is to reduce the
dimensionality of a data set consisting of a large number of interrelated variables,
while retaining as much as possible of the variation present in the data set. This is
achieved by transforming to a new set of variables, the princi
pal component (PCs),
which are uncorrelated, and which are ordered so that the first few retain most of the
variation present in all of the original variables.

Therefore, we can represent an image in a
t
-
dimensional vector space and use
Principal component

analysis to find a subspace whose basis vectors corresponds to
the maximum
-
variance directions in the original space. Let W represent the linear
transformation that maps the original t
-
dimensional space onto a f
-
dimensional feature
subspaces where
f t

. The new feature vectors
f
i
y

and defined by

1.....
T
i i
y W x i N
 
. The columns of W are the eigenvalues
i
e
obtained by solving
the eigenstructure decomposition
i i i
e Qe


, where
T
Q XX

is the covariance matrix
and
i


is
the

eigenvalue associated with the eigenvector
i
e
. Before obtaining the
eigenvectors of Q, we should take following two steps: (1) the vectors are normalized
such that
i
x
to make

the system invariant to the intensity of the illumination source.

(2) the average of all images is subt
racted from all normalized vectors to ensure that
the eigenvector with the highest eigenvalue represents the dimension in the
eigenspace in which variance of vectors is maximum in a co
rrelation sense.




Figure

3
: Gaussian Samples



Figure

4
:

Gaussian Samples with eigenvectors of sample covariance matrix





Figure

5
: PC projected samples









Figure

6
: PC dimensionality reduction step


2.2 The LDA Space

Linear Discriminant Analysis(LDA) searches for those vectors in
the underlying
space that best
discriminate among classes(rather than those that best describe the data ). More formally, given a
number of independent features relative to which the data is described, LDA creates a linear
combination of these which yields the largest me
an differences between the desired classes.
Mathematically speaking, for all the samples of all classes, we d
efine two measures: (1) one is called
within
-
class scatter matrix, as given by
1 1
( ) ( )
j
N
c
j t j t
w i j i j
j i
S x x
 
 
  


Eq.
1


where
j
i
x
is the
th
i
sample of class
j
,
j


is the mean of class
j
,
c
is the number of
classes, and
j
N
the number of samples in class
j

; and (2) the other

is called
between
-
class scatter matrix


1
( )( )
c
t
b j j
j
S
   

  


Eq.
2

w
here



represents the mean of all classes.

The goal is to maximize the between
-
class measure while minimizing the within
-
class
measure. One way to do this is to maximize the ratio
det
det
b
w
S
S
. The advantage of using
this

ration is that it has been proven that if
w
S
is a nonsingular matrix then this ratio is
maximized when the column vectors of the projection matrix

W
, are the eigenvectors
of
1
w b
S S

. I
t should be noted that: (1) there are at most
1
c


nonzero generalized
eigenvectors and , so, an upper bound on
f

is
1
c


, and (2) we require at least

t c


samples to guarantee that does not become singular(which is almost impossible in any
realistic application). To solve this, someone proposed the use of an intermediate
space. In both cases, this intermediate space is chosen to be the PCA space. Thus, the
original t
-
dimensional space is projected onto an intermediate g
-
dimensional space
using PCA and then onto a final f
-
dimensional space using LDA.



Following let us generalize the LDA case by case:


Case1: Project the data points
from
two
different

cla
sses

into one
-
di
Figure
1
mensional space

The
goal we want to achieve is
to find a vector

w
, and project the data points onto

w
, thus we

get a new coordinate

y
.


t
y w x



Eq.
3

The concept of LDA is to project the data

points

of the same class onto the new
coordinate as more close

as possible while
the data from different classes as more far
away

as possible. To describe the concept, we need some quantitative va
riables to
represent that
, the first is the mean of each class data.


1
i
i
x D
i
m x
n




Eq.
4

The mean of each class data after projecting is :


1 1 1
i i
t t t
i
i
x D x D
i i i
m y w x w x w m
n n n
 
   
 

Eq.
5

Where
i
n
is the number of data of the
th
i
class.
i
D

is the collection of the
th
i

class
data.
i
Y

is the collection of the projected
th
i

class data.

Furthermore
,
w
e can find that the mean of each class after projecting is the projecti
on
of the mean of original data in the high dimensional space.

Next we define the distance between the two classes after projecting.


1 2 1 2
( )
t
m m w m m
  

Eq.
6

Also, define
the scatter in each class after projecting


2 2
( )
i
i i
y Y
s y m

 


Eq.
7

Based

on the concept of LD
A, the further far
the projected data
of each class separate
each other the

larger the mean difference. The more concentrated the projected data of
the same class the smaller the divergence within each class.

We can represent the concept mathematically

as below:

2
1 2
2 2
1 2
( )
m m
J w
s s




Eq.
8

One way to do this is to maximize the ratio.

( )( )
i
t
i i i
x D
S x m x m

  


Eq.
9

2
2
( )
( )( )
i
i
t t
i i
x D
t t
i i
x D
t
i
s w x w m
w x m x m w
w S w


 
  




Eq.
10

Let
1 2
w
S S S
 

2 2
1 2 1 2
( )
t t
w
s s w S S w wS w
   

Eq.
11



2
2
1 2 1 2 1 2 1 2
( ) ( )( )
t t t t
t
B
m m wm wm w m m m m w
w S w
     


Eq.
12

So we get that


t
B
t
w
w S w
J w
w S w


The
w
make the largest


J w
satisfy the
following equation:

B W
S w S w



Eq.
13

This is a generalized eigenvalue problem.

When is
W
S
in
versible the equation above is a simple eigenvalue problem

1
W B
S S w w




Eq.
14

But the direction of
B
S w
is
1 2
( )
m m

, the solution is the equation below exactly, we
don’t to solve the eigenvalue problem.

1
1 2
( )
W
w S m m

 

Eq.
15

Case2: Project multiple class data onto high dimensional space

Now we make some change to satisfy the need for multiple data and high dimensional
space. First, we convert int
o the version for multiple data (class number >2)


1
c
W i
i
S S




Eq.
16

Then convert
B
S
into following equation:

1
( )( )
c
t
B i i i
i
S n m m m m

  


Eq.
17

Note the difference between
B
S
of case1.

General speaking
B
S
still describe the divergence within each class.

While projecting data onto higher dimensional space, we no

more request the only
one vector
w
, but seek a set of basis. Consequently, we represent some
w

in a matrix
form
W
, each column is a basis. So the
denominator

and
numerator

of original

equation become:


t t
B B W W
S W S W S W S W
 

Eq.
18

The
( )
J W
of Case1 becomes:

( )
t
B
t
w
W S W
J W
W S W


Eq.
19

Note that the
W
in the
( )
J W
is a matrix which represent a set of basis.

As sh
own in , the basis derived by PCA generate the minimum error(Euclidean
distance).

On the other hand the basis derived by LDA is very different from those of PCA.

From the
figure we can induce that the data points projected onto the LDA basis are
separated into two classes while PCA can’t separate that.



Figure

7
: The difference between PCA and LDA



CHAPTER 3
FACIAL EXPRESSION CLASSIFIER

3.1 Introduction to Machine
Learning

Machine learning is programming computers to optimize a performance
criterion
using example data or past experience. We have a model defined up to some
parameters, and learning is the execution of a computer program to optimize the
parameters of t
he model using the training data or past experience. The model may be
predictive to make predictions in the future, or descriptive to gain knowledge from
data, or both.

General speaking, there are two major
types of learning: supervised learning and
unsup
ervised learning with respect to the input data.

In supervised learning, the aim is to learn a mapping from the input to an output
whose correct values are provided by a supervisor. In unsupervised learning, there is
no such supervisor and we only have in
put data. The aim of unsupervised learning is
to find regularities in the input. There is a structure to the input space such that certain
patterns occur more often than others, and we want to see what generally happens and
what does

not.Before we start to

introduce the machine learning notation
mathematically, we illustrate some notation.

We use to
( )
i
x
denote the “input”

variables, also called input features, and
( )
i
y
to
denote the “output”

or target variable that
we are trying to predict. A pair
( ) ( )
( )
i i
x y
is
called a training example, and the dataset that we’
ll be using to learn
a list of

m
training examples
( ) ( )
{( );1.....}
i i
x y i m


is called a training set. We will also use

denote the space of inpu
t values, and

the space of the output values. To describe the
supervised learning problem slightly more formally, our goal is, given a training set ,
to learn a function

:
h


so that
( )
h x
is a “good” predictor for

the
corresponding value of

y
.
T
his function
( )
h x
is called a hypothesis. The following
picture illustrate
s

the supervised le
arning process:


Figure

8
: The process of supervised learning


3.2 Linear Regression


When the target variable that we’re trying to predict is continuous, we call the
learning problem a regression problem. When
y
can take on only
a small number of
discrete values, we call it a classification problem.


Figure

9
: The example of linear regression

To perform supervised learning
, we decide to approximate
y
as a linear function
of
x

0 1 1 2 2
( )
h x x x
  
  

Eq.
20

w
here
the
i

’s
are the parameters parameterizing the space of linear functions
mapping from


to

. To
simplify our notation, we also introduce the convention of
letting
0
1
x


(this is called intercept term), so that

0
( )
n
T
i i
i
h x x x
 

 


Eq.
21

Now, given a training set , how do we l
earn the parameters

? One reasonable method
seems to make
( )
h x

close to

.

We define the cost function:





2
1
1
( ) ( ( ) )
2
m
i i
i
J h x y


 


Eq.
22

We
want to choose


so as to minimize
( )
J

. To do so, let

u
s use a search

algorithm that starts

with some “initial guess” for

, and that repeatedly

changes

to make
( )
J


smaller, until hopefully we converge to a value of

θ that minimize
( )
J

. Specifically, let

u
s consider the gradient descent

algorithm, which starts with some initial

, and repeatedly
performs the

update:



:
j j
j
J
   





Eq.
23

Here,


is called the learning rate. This is a very natural algorithm that

repeatedly

takes a step in the direction of steepest decrease of
J
.

In order to implement this algorithm, we have to work out what is the

partial derivative term on the right hand side. Lets first work it out for the

case of if we have only o
ne training example


,
x y
, so that we can neglect

the sum in the definition of
J
. We have:























2
0
1
2
1
2
2
j j
j
n
i i
i
j
i
J h x y
h x y h x y
h x y x y
h x y x

 




 
 
 

    


 
   
 

 
 


Eq.
24

For a single training example, this gives the u
pdate rule:











:
i i i
j j j
y h x x
 
 

Eq.
25

The rule is called the LMS update rule
.

This rule has several properties that seem
natural and intuitive. For instance, the magnitude of the update is
proportional to the
error term








i i
y h x

.


We’d derived the LMS rule for when there was only a single training

example. There are two ways to modify this method for a training set of

more than one example. The first is replace

it with the following algorithm:

Repeat until convergence{











1
:
m
i i i
j j j
i
y h x x
  

  



(for every
j
)

Eq.
26

}

We can easily verify that the quantity in the summation in the update

rule above is just


j
J
 
 
. So, this is simply gradient descent on the original cost
function
J
. This method looks at every example in the entire training set on every
step, and is called batch gradient descent. Note that, while grad
ient descent can be
susceptible
to local minima in general, the optimization problem we have posed here

for linear regression has only one global, and no other local, optima; thus

gradient descent always converges (assuming the learning rate α is not too

large) to the global minimum. Indeed,

J

is a convex quadratic function.

Here is an example of gradient descent as it is run to minimize a quadratic

function.


Figure

10
: The example of gradient descent

The ellipses shown above
are the contours of a quadratic function. Also shown is
the trajectory taken by gradient descent. The
x
’s
in the figure mark the successive
values of


that gradient descent went through.

Consider the problem

of predicting
y
from
x

. The leftmost figure below

shows the result of fitting a
0 1 1
y x
 
 

to a dataset. We see that the data

doesn’t really lie on straight line, and so the fit is not very good.

Instead, if
we had added an extra feature
2
x
, and fit
2
0 1 1 2
y x x
  
  

,
then we
obtain a slightly better fit to the data.

(See middle figure) Naively, it
might seem that
the more features we add, the

better. However, there is

also
a dang
er in adding too
many features.
The r
ightmost figure is the result of

tting a 5
-
th order poly
nomial
5
0
j
j
j
y x




. We see that even though the

tte
d curve passes through the data
perfect
ly, we would not expect this to
be a very good predictor.
Without formally
defining what these terms m
ean, we’ll say the figure
on the left shows an instance of
underfi
tting

in which the data clearly
show
s structure not captured by the
mode
l

and the figure on the right is
an example of overfi
tting.


Figure

11
: The different situation of training data fitting


Figure

12
:
There are many ways to separate the data from different classes


3.3 Introduction to Support Vector Machine



We’ll start our introduction

on SVMs by talking about margins. This section will

give the intuitions about margins a
nd about the “confidence” of our
prediction
.

Consider logistic r
egression, where the probability


1|;
p y x


i
s modeled by




T
h x g x


. We
would then predict

“1” on an input
x

if and

only if


0.5
h x

, o
r
equivalently, if and only if
0
T
x



. Consider a

positive trainin
g example (
1
y

).
The larger
T
x


is, the larger also is




1|;,
h x p y x w b
 
, and thus also the h
igher
our degree of “confidence”
that the label is 1. Thus, informally we can t
hink of our
prediction as being
a ve
ry confident one that
1
y


if
0
T
x


.
Similarly, we think of
logistic
regression as making a very confi
dent prediction of
0
y

, if
0
T
x


. Given

a training set, again informally it seems that we’d have found a good fit to

the t
raining

data if we can find

so that
0
T
x


whenever


1
i
y

, and

0
T
x

whenever


0
i
y


, since this would reflect a very confident (and

correct) set of
classifications for all the training examples. This seems to be

a nice goal to aim for, and we’ll soon formalize this idea using the notion of

functional margins.

For a di

erent type of intuition, consider the following figure, in which x’s

represent

positive training examples, o’s denote negative training examples,

a decision boundary (this is the line given by the equation
0
T
x


, and

is also called the separating hyperplane) is also shown, and three points

have also been labeled
A, B and C.


Figure

13
: The confident points at different locations of output space


No
te

that the point A is very far from the decision boundary. If we are

asked to make a
prediction for the value of
y
at A, it seems we should be

quite confident tha
t
1
y


there. Conversely, the point C is very close to

the decision boundary, and while it’s on the side of the decision boundary

on which we would predict
1
y

, it seems likely that just a small
change to

the decision boundary could easily have caused out prediction to be
0
y

.

Hence, we’re much more confident about our prediction at A than at C. The

point B lies in
-
between these two cases, and more broadly, we see that if

a po
int is far from the separating hyperplane, then we may be significantly

more confident in our predictions. Again, informally we think it’d be nice if,

given a training set, we manage to find a decision boundary that allows us

to make all correct and confident
(meaning far from the decision boundary)

predictions on the training examples.

To make our discussion of SVMs easier, we’ll first need to introduce a new

notation for talking about classification. We will be considering a linear

classifier for a binary class
ification problem with labels
y
and features
x
.

From now, we’ll use
{ 1,1}
y
 

(instead of {0, 1}) to denote the class labels.

Also, rather than parameterizing our linear classifier with the
vector

, we

will use parameters
,
w b
, and write our classifier as





,
T
w b
h x g w x b
 

Eq.
27

Here,

( ) 1
g z

if
0
z

, and
( ) 1
g z

otherwise. This “
,
w b
” notation

allows us to explicitly treat the intercept term
b
separately from the other

parameters. (We also drop the convention we had previously of letting
0
1
x

be an
extra coordinate in the input feature vector.) Thus,

b
takes the role of

what was previously
0

, and
w
takes the role of
1
[.....]
T
n
 
.

Note also
that, from our definition of g above, our classifier will directly

predict either 1 or −1
, without first going
through the intermediate step of estimating
the probab
ility of y being 1.

Lets formalize the notions of the functional and geometric margins. Given
a

training example




(,)
i i
x y
, we define the functional margin of
(,)
w b
with

respect to the training example
.







i i
T
r y w b
 

Eq.
28

Note that if


1
i
y

, then for the functional margin to be large
, then we need
T
w b

to be a large positive number. Conversely, if


1
i
y
 

, then for the functional
margin to

be large, then we need
T
w b

to be a large negative number. Moreover,

if




0
i
T
y w b
 
, then our prediction on this example is correct.

Hence, a large
functional mar
gin represents a confident and a
correct prediction.

For a linear classifier with the choice of
g
given above (taking values in

{−1, 1}), there’s one property of the functi
onal margin that makes it not a
very good
measure of confidence, however. Given our choice of
g
, we note that
if we replace
w
with and
b
with , then since




2 2
T T
g w b g w x b
  
,

this would not change


,
w b
h x
a
t all.
g
and


,
w b
h x
depends
only on the sign,

but not on the magnitude, of


T
w b

.
However, replacing
(,)
w b
with
(2,2 )
w b
also results in multipl
ying our functional
margin by a
factor of 2. Thus, it seems that by exploiting our freedom to scale
w
and
b
,
we can make the functional margin arbitrarily large without really changing

anything meaningful. Intuitively, it might therefore make sense to impose

some sort of normalization co
ndition such as that
2
1
w

. That is
we might
replace
(,)
w b
with
2 2
(,)
w w b w
, and instead consider the functional

margin of
2 2
(,)
w w b w
.

Given a training set






(,);1,....,
i i
S x y i m
 
, we also
define the
function mar
gin of
(,)
w b
with respect to
S
as the smallest of the functional

margins of the individual
training examples. Denoted by
ˆ

, this can therefore

be written:



1,...,
ˆ
ˆ
min
i
i m
 



Eq.
29

Next, lets talk about geometric margins. Consider the picture below:


Figure

14
: Illustration of the margins.

The decision boundary corresponding to
(,)
w b
i
s shown, along with
the

vector
w
. Note that
w
is orthogonal to the separating hyperplane. Consider the point at
A, which represents the input


i
x
of some training example with

label


1
i
y
 
.

Its distance to the decision boundary,


i

,

is given by the line

segment AB.

How can we find the value of


i

?
w w
is a unit
-
length vector
pointing in the same direction as
w
.
Since A represents


i
x
,
we therefore find

that the point B is given by




i i
x w w

 
. But this
point lies on
the decision boundary, and all points
x
on the decision boundar
y satisf
y the
equation
0
T
w b
 
.
Hence,








0
i i
T
w x w w b

   

Eq.
30

Solving for


i

yields






T
i
T
i i
w x b w b
x
w w w

 

  
 
 
 


This was worked out for the case of a positive
training example at A in the


gure,
where being on the “positive” side of the decision boundary is good.

More generally, we define the geometric margin of
(,)
w b
with respect to a

training example




(,)
i i
x y
to be







T
i i i
w b
y x
w w

 
 
 
 
 
 
 
 
 

Eq.
31

Note that if
1
w

, then the functional margin equals the geometric

margin

this thus gives us a way of relating these two di

erent notions of

margin. Also, the geometric
margin is invariant to rescaling of the parameters
.

That is
if we replace
w
with
2
w
and
b
with
2
b
, then the geometric margin
does not
change. This will in fact come

in ha
ndy later. Specifically, because
of this invariance
to the scaling of the parameters, when trying to fit
w
and
b
to training data, we can
impose an arbitrary scaling constraint on
w
without changing anything important. F
or
instan
ce, we can demand that
1
w

, or
1 2
2
w b w
  
,
and any of these can be
satisfied simply by

rescaling
w
and
b
.

Finall
y, given a
training set






(,);1,....,
i i
S x y i m
 
,
we also define

the geometric margin of
(,)
w b
with respect
to
S
to be the smallest of the

geometric margins on the individual training examples:



1,...,
ˆ
ˆ
min
i
i m
 



Eq.
32


Given a training set, it seems from our previous discussion that a natural

desideratum is to try to find a decision boundary that maximizes the

margin, since this
would reflect a very confident set of
predictions on the training set and a good “fit” to
the t
raining data. Specifically, this

will result in a classifier that separates the pos
itive
and the negative training

examples with a “gap”.

For now, we will assume that we a
re given a training set that is

linearly

separable.

I
t is possible to separate the positive and negative examples using some separating
hyperplane. How we

find the one that

achieves the maximum geometric margin? We
can pose the following optimization problem:







,,
max
.. , 1,...,
1.
r w b
i i
T
s t y w x b i m
w


  


Eq.
33

W
e want to maximize

, subject to each training example having functional
margin at least

. The
1
w

const
raint moreover ensures that the

function
al margin
equals to the geometric margin, so we are als
o guaranteed

that all the geometric
margins are at least γ.
Thus, solving this problem will

result in
(,)
w b
with the largest
possible geome
tric margin with respect to the

training
set.

If we could solve the
optimization problem above, we’d be done. But the“
1
w

” constraint is a nasty

one,
and this problem certainly

isn’t in any format that we can plug into st
andard
optimization software to

solve. So, lets try t
ransforming the problem into a nicer one.
Consider:







,,
ˆ
max
ˆ
.. , 1,...,
r w b
i i
T
w
s t y w x b i m


  

Eq.
34

Here, we’re going to maximize
ˆ
w

, subject to the functional margins all

being at least
ˆ

.
Since the geometric and functional margins are related by
ˆ
w



,

this will give us the answer we want. Moreover, we’ve gotten rid

of the constraint
1
w

that we didn’t like. The downside is that we now

have a nas
ty
objective
ˆ
w

function

and

we still don’t

have any o

-
the
-
shelf software that can sol
ve
this form of an optimization

problem.

Lets keep going. Recall our earlier discussion that we can add an arbitrary

scaling constraint on
w
and
b
without changing

anything. This is the key idea

we’ll use
now. We will introduce the scaling

constraint that the functional

margin of
w
,
b
with
respect to
the training set must be 1:

ˆ
1



Since multiplying
w
and
b
by some constant results in the functional mar
gin

being
multiplied by that same constant, this
is indeed a scaling constraint,

and can be
satisfied by rescaling
w
,
b
. Pluggi
ng this into our problem above,

and noting that
maximizing is the same thing as minimizing, we now have the following optimization
problem:







2
,,
1
max
2
.. 1, 1,...,
r w b
i i
T
w
s t y w x b i m
  

Eq.
35

We’ve now transformed the problem into a form that can be e

ciently

solved.
The above is an optimization problem with a convex quadratic objective and only
linear constraints. Its solution gives us the optim
al margin classifier.

CONCLUSION


We illustrate our facial expression analysis system

and explain each part of our
system. In each part of our system, we introduce many key techniques u
sed in our
system. We have an idea about what the AU is and we understand the advantage of the
AFA system. We also know the difference between PCA and LDA. Furthermore, we
have the basis concept of support vector machine. In the future, we should try more
types of feature to seek the most appropriate one to represent the input image, and we
should reveal the association between different classifier and different
feature

such
that we can find the best combination to perform the facial expression analysis.

RE
FERENCE

[1] Facial Expression Coding Project, cooperation and competition

between Carnegie Mellon Univ. and Univ. of California, San

Diego, unpublished, 2000.


[2
] T.

Kanade, J. Cohn, and Y. Tian,
Comprehensive Database for

Facial Expression Analysis,º

Proc. Int'l Conf. Face and Gesture

Recognition, pp. 46
-
53, Mar. 2000.

[
3
] G. Donato, M.S. Bartlett, J.C. Hager, P. Ekman, and T.J. Sejnowski,

Classifying Facial Actions,

IEEE Trans. Pattern Analysis and

Machine Intelligence, vol. 21, no. 10, pp. 974
-
989,
Oct. 1999.

[
4
]
R.R. Rao,
Audio
-
Vi
sual Interaction in Multimedia,

PhD thesis,

Electrical Eng., Georgia Inst. of Technology, 1998.

[
5
] H.A. Row
ley, S. Baluja, and T. Kanade,
Neural Network
-
Based

Face Detection,

IEEE Trans. Pattern Analysis and Machine

intelligence, vol. 20, no. 1, pp. 23
-
38, Jan. 1998.

[
6
] L.S. Chen, Joint processing of audio

visual information for the recog
nition of
emotional expressions

in human

computer interaction, Ph.D. Thesis, University of
Illinois at Urbana
-
Champaign,

Department

of Electrical Engineering, 2000.

[
7
] J. Lien, Automatic recognition of facial expressions using hidden
Markov models
and estimation of
expression intensity. Ph.D. Thes
is, Carnegie Mellon
University,
1998.

[
8
] T. Otsuka, J. Ohya, A study of transformation of

facial expressions
based on
expression recognition

from temproal image sequences, Technical Report, Institute
of
Electronic, Information, and

Communications Engineers (IEICE), 1997.

[
9
] L.R. Rabiner, B. Juang, Fundamentals of Speech Recognition, Prentice
-
Hall,
Englewood Cliffs, NJ,

1993.

[1
0
] P. Ekman, W.V. Friesen, Facial Action Coding System: Investigator_s
Guide,
Consulting Psychologists

Press, Palo Alto, CA, 1978.