Facial Component Extraction and Face Recognition
with Support Vector Machines
Dihua Xi,Igor T.Podolak,and SeongWhan Lee
Center for Articial Vision Research,Korea University
Anamdong,Seongbukku,Seoul 136701,Korea
fdhxi,uipodola,swleeg@image.korea.ac.kr
Abstract
A method for face recognition is proposed which
uses a twostep approach:rst a number of facial
components are found,which are then glued together,
and the resulting face vector is recognized as rep
resenting one of the possible persons.During the
extraction step,a wavelet statistics subsystem provides
the possible locations of eyes and mouth which are used
by the Support Vector Machine (SVM) subsystem to
extract facial components.The use of wavelet statistics
subsystem speeds up the recognition process markedly.
Both the feature detection SVMs and wavelet statistics
are trained on a small number of actual images with
features marked.Afterwards,a large number of face
vectors are constructed,which are then classied with
another set of SVM machines.
1.Introduction
Face recognition emerged as an important compo
nent of pattern recognition with a vast number of pos
sible applications.A number of dierent approaches
are used to tackle this problem,including eigenfaces,
PCA,neural networks,and support vector machines
[6,1,7].Recently,the SVM method has also been
applied to the face authentication [9].
Two basic approaches (while using SVMs) to face
recognition are possible:either the whole image is rec
ognized (as in [1,3]),or some selected features are
extracted rst,and the recognition follows.We have
decided to pursue this componentbased approach [2],
This research was supported by Creative Research Initia
tives of the Ministry of Science and Technology,Korea.Current
address of Dr.I.T.Podolak:Institute of Computer Science,
Jagiellonian University,Krakow,Poland.
since the whole image method suers greatly from any
shifts of the image.
Basically,a set of selected features (like eyes,mouth,
nose,etc.) is extracted rst,then by concatenating
features a facevector is built which is eventually rec
ognized.In our system all of the features are extracted
using SVMs specially trained for each task.As the
feature extraction using only SVMs,requiring the use
of a sliding window moving across the whole image
and would be very slow,we decide to use a number
of other methods.First,a wavelet detection system is
employed to give hints as to where both eyes and the
mouth are.The area to check with SVMs is therefore
much reduced.Additionally,a face geometry is dened
in fuzzy logic terms,which makes it possible to predict
the most probable locations of other features (eg.nose,
nosebridge),again reducing the search area.As a set
of candidates are found,the same face geometry is used
to construct facevectors and select these with the best
fuzzy membership values.
Another set of SVMs are trained to perform the ac
tual recognition of persons with facevectors as exam
ples.During recognition the same track is followed:
feature extraction,then facevector composition and
recognition.
2.Component Location with Wavelet
Statistics System
A fast and stable algorithm used to search for the
proper candidates for the eyes and mouth in a face
image is very important for the face recognition system
using SVM.In this section,a novel approach based on
wavelet and statistics is described.The multiresolution
wavelet is used to decompose a face image into sub
images.A facial model based on modied Bookstein
coordinate is constructed which is scale independent.
Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition (FGR02)
0769516025/02 $17.00 © 2002 IEEE
Experimental results proved this approach to be very
fast and able to nd correct candidates.
2.1.Facial coordinate system
The facial coordinate system based on Bookstein's
is used to describe the geometric shape of a face.Two
categories of coordinates (shown in rightbottom of
Figure 1) are contained and used to indicate the lo
cation and shape of each component respectively.The
main coordinate system is used to describe the centers
of brows,eyes and mouth.Its origin is set to the center
of left eye,and the distance of the left and right eyes is
set to unity.For a face almost frontal and upright,the
distance between two eyes can be approximated with
their horizontal distance.Then the coordinate of the
right eye is (1;y
B
re
).
LH
4
LH
3
LH
2
LH
1
HL
4
HL
3
HL
2
HL
1
Figure 1.Main and componential coordinates
used to dene facial geometry shape.
Suppose the screen coordinates of the centers of
left and right eyes are (x
le
;y
le
) and (x
re
;y
re
),then
their corresponding facial coordinates are (0;0) and
(1;
y
re
y
el
x
re
x
le
) respectively.Therefore,the centers of all
ve components can be described by a 7 dimensional
vector (y
B
re
;x
B
lb
;y
B
lb
;x
B
rb
;y
B
rb
;x
B
m
;y
B
m
).
The shape of a facial component is dened by sev
eral feature points.For each component,we dene a
componential coordinate system whose origin is set to
the componential center and whose unity is set to equal
to the unity of the main coordinate.The componential
coordinate of the mouth is shown in Figure 1 by the
coordinate x
m
s
y
m
s
.
With this coordinate system used,the facial shape
coordinates will remain unchanged even when the size
of the image or the face rectangle is changed.The
use of componential coordinates for each component
makes estimation and comparison of components from
dierent images possible.The whole exact facial shape
can be estimated from the x distance between the left
and right eyes.
2.2.Image decomposition
Using multiresolution wavelet [5],an image can be
decomposed into a sequence of subimages which in
clude dierent frequency information corresponding to
dierent directions at dierent scales [4].Suppose
f(x;y) is an image,let f
0
LL
(x;y) = f(x;y),then
f
n
LL
(x;y) = f
n+1
LL
(x;y) f
n+1
LH
(x;y)
f
n+1
HL
(x;y) f
n+1
HH
(x;y):
(1)
The LH and HL sub images which indicate the com
ponents of horizontal and vertical information of an
image are used in our research.Figure 1 gives an ex
ample of the decomposition of an image.The route of
the image decomposition is shown in Figure 2.Notice
that the width and height of a sub image at level n are
half of that at level n 1.
f(x;y) f
1
LL
(x;y) f
2
LL
(x;y) f
N
LL
(x;y)
f
1
LH
(x;y) f
2
LH
(x;y) f
N
LH
(x;y)
f
1
HL
(x;y) f
2
HL
(x;y) f
N
HL
(x;y)




@
@R
@
@R
@
@R
@
@R
A
A
AU
A
A
AU
A
A
AU
A
A
AU
level 0 level 1 level 2 level N
Figure 2.Route of wavelet decomposition.
2.3.Feature vector construction
In previous sections,we have shown that the feature
of a point can be described by its facial coordinates and
responses in LH and HL sub images at all levels.Sup
pose the decomposition level is up to N,then the vector
(x
B
;y
B
;f
1
LH
(x;y);f
1
HL
(x;y); ;f
N
LH
(x;y);f
N
HL
(x;y)),
indicates the feature of a point (x;y),it is named
feature vector.To reduce the feature vector di
mension,let f
i
(x;y) = f
i
LH
(x;y) + f
i
HL
(x;y),
and it is possible to represent the vector with
(x
B
;y
B
;f
1
(x;y); ;f
N
(x;y)).To describe the whole
facial model,it is still required that the center of each
component can be described only by its coordinate
(x
c
i
;y
c
i
);(i = 1; ;N).
Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition (FGR02)
0769516025/02 $17.00 © 2002 IEEE
In summary,a face model can be dened by x
i
=
(x
c
i
;y
c
i
) and v
i
= (v
1
i
; ;v
N
i
i
),where x
i
is the center
of component i and v
j
i
is the feature vector of j
th
fea
ture point of component i,N
i
is the number of feature
points of component i.We called the production of
fx
i
g and fv
j
i
g;(i = 1; ;5;j = 1; ;N
i
) normal
ization of a facial image.
2.4.Statistical face model and training
For training the model,400 facial images (512512)
including feature points for all ve facial components
(screen coordinate),including both brows,eyes,and
mouth,are included in our database.First,all N facial
images are initialized to estimate their corresponding
facial shape models.The mean face is produced by
calculating the mean of each componential center and
all feature points for each component.Let f
x
i
g and
f
v
j
i
g be the mean of all faces.Then,we can estimate
the variable rectangle for the center of each compo
nent.Actually,only the y coordinates of the right eye,
mouth,and both brows need to be processed.The sta
tistical face model contains of the mean face and rect
angle of each component used to constrain its moving
area.This model will be used to search for the loca
tions of eyes and mouth.
2.5.Searching for eyes and mouth candidates
In this section,we are going to introduce the fast
algorithm to search for candidates for eyes and mouth
of any input facial image.
Using the algorithm of section 2.2,an input image
is rst decomposed into a sequence of sub images at
all levels (we constrain both the width and height of
smallest sub image to be no less than 32,otherwise it
will be too small to be recognized).The f
LH
+ f
HL
of sub images at each level is used for searching by the
model.
To match a facial component in a sub image,the
modied cross correlation (MCC) of two point sets,
(x
i
;y
i
) and (x
0
i
;y
0
i
)(i = 1; ;N),is used.Suppose
a
i
= f(x
i
;y
i
);b
i
= f
0
(x
0
i
;y
0
i
)),the MCC is calculated
by
N
X
i=1
(a
0
i
a)(b
0
i
b);(2)
where
a
0
i
= a
i
=
N
X
i=1
a
i
;b
0
i
= b
i
=
N
X
i=1
b
i
;a =
1
N
N
X
i=1
a
0
i
;
b =
1
N
N
X
i=1
b
0
i
:
Two main ideas,coarse to ne graining and high to
low level sub images,are used to design a fast eyes and
mouth searching algorithm.The eyes and brows don't
need to be distinguished in the highest level sub image,
because the sub image is too small.Since the eye is
prone to be located erroneously near the brows,we use
models of both eyes and brows to distinguish them at
the next level (when sub image size is near 64 64).
At the smallest level,it is not dicult to search for
candidates in the whole image using little computing
time.The candidates are adjusted for better matching
by moving lightly the center of each component in the
higher level.Therefore,the system is very fast.
The algorithmof searching of candidates of eyes and
mouth is listed below.Experimental results show no
missing locations when 5 candidates used.And over
80% of test images are correctly located at the rst
candidate.
Begin
for level from the highest N to lowest 1
Set moving rectangle RECT of left eye:the
whole or the lefttop quarter of the sub image
at at the highest level N,but moving lightly
to the position decided by the last level
Read the sub image f
LH
+f
HL
at level
for the left eye moving in RECT
for varying the distance of left and right
eyes:big at level N but very small for
others
Produce all the feature points using
the mean facial model
Estimate the matching MCCs of the
model of left and right eyes
for varying the mouth position lightly
in both x and y
Estimate the matching MCC of
mouth
Calculate the sum of MCCs of all
three components (eyes and mouth)
Compare the sum with the stored
sums and only keep 5 largest
endfor
endfor
if the distance of left and right eye is greater
than that in the mean facial model at
level 3 when subimage size is 64 64
for slightly change the vertical distance
of eyes and brows
Distinguish the eyes and brows by
moving the eyes vertically and max
imize the MCCof all brows and eyes
Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition (FGR02)
0769516025/02 $17.00 © 2002 IEEE
endfor
endif
endfor
End
Output:ve candidates of positions of eyes and
mouth
Since all of the search consists only of complete pro
cessing at smallest sub image and trivial adjustment
needed at other levels,the algorithm can run very fast.
In fact,it needs less than 0:1 second for a 512 512
input image.
3.Face Recognition using SVM
Support Vector Machine is an implementation of
structural risk minimization principle,developed by
Vapnik et al.[10],whose object is to minimize the upper
bound on the generalization error.
In the case of a linearly separable two class problem,
with examples f(x
i
;y
i
)g
l
i=1
;x
i
2 R;y
i
2 f1;+1g,
the algorithm maximizes the margin of error which
is the perpendicular distance from the separating hy
perplane w:x w + b = 0 from the nearest positive
example plus the distance from the nearest negative
example.We expect better generalization capability
from a hyperplane with margin maximized.In order
to achieve this goal kwk
2
is minimized subject to a set
of constraints
y
i
(x
i
w+b) 1 0;i = 1; ;l (3)
i.e.each example is classied by a distance from the
separating hyperplane of at least 1.
This is done by reformulating the problem in terms
of positive Lagrange multipliers
i
:the following Lan
grangian has to be minimized
L
P
=
1
2
kwk
2
l
X
i=1
i
(x
i
w+b) +
l
X
i=1
i
(4)
subject to constraints (3).Minimization of L
P
(the pri
mal problem) can be expressed as (the dual problem)
the maximization of
L
D
=
l
X
i=1
i
1
2
l
X
i=1
l
X
j=1
i
j
y
i
y
j
x
i
x
j
(5)
subject to constraints
P
i
y
i
= 0 and
i
> 0.In the
solution only a small number of
i
coecients dier
from zero.As each
i
corresponds to one data point,
these are the only data included in the solution.These
will be the support vectors { points lying on the margin
border.
For problems not linearly separable,a mapping of
the input space into a high dimensional space x 2
R
l
7!(x) 2 R
h
is needed,which gives a much
higher probability that the mapped points will be lin
early separable.The important element of the dual
problem specication is that the data points do not
appear by themselves,but rather as dot products of
all pairs.This makes it possible,instead of an explicit
choice of the feature space and explicit mapping,to
use a kernel function K(a;b) { a positive symmetric
function,a scalar product in the feature space
T
(x
(1)
)(x
(2)
) =
X
i
(x
(1)
i
)(x
(2)
i
) = K(x
(1)
;x
(2)
)
(6)
When the dotproduct in (5) is substituted with this
kernel function,we obtain
L
D
=
l
X
i=1
i
1
2
l
X
i=1
l
X
j=1
i
j
y
i
y
j
K(x
i
;x
j
) (7)
The Gaussian RBF,polynomial or hyperbolic tan
gent kernel functions can only be used as kernel func
tions.The solution to the classication is therefore
given by the sign of the function
f(x) =
N
SV
X
i=1
i
y
i
K(s
i
;x) +b (8)
where s
i
are the support vectors (N
SV
in total) { data
points with nonzero Lagrange multipliers.For the ex
ample x to be classied it is only needed to compute
the sign of f(x).
3.1.Feature extraction
During the processing of the images,the features
are extracted rst.This is done with a set of spe
cially trained SVMs which utilize feature location in
formation provided by the wavelet statistics subsystem.
In order to train SVMs,we have selected 32 face im
ages out of the database (we have used the AT&T face
database) and hand marked the features.For each of
them an SVM was trained.A bootstrapping method
was employed:rst an SVMwas trained on a small set
of examples,then other examples were checked with
those badly recognized being added to the training set.
The SVM was retrained after every few hundred new
examples were added.Training using all examples was
prohibitively long.Actually,a method of training rst
Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition (FGR02)
0769516025/02 $17.00 © 2002 IEEE
with window skipping 7 pixels both horizontally and
vertically,and then relearning with a skip of 4 proved
to be quickest with good generalization results too.
We used a conglomerate of 2 SVMs for each fea
ture detector,with machines trained on the same set,
but employing dierent kernel functions.This enabled
us to enhance the generalization capabilities of the de
tector greatly.The second SVM was used only when
the rst could detect any candidates for the feature
(or less than a set minimum).This made it possible
to use the SVMs with smaller number of support vec
tors rst (ie.faster).Actually the secondary machines
were used rarely,in less than 10% cases.Both polyno
mial (of degrees 2 and 3) and linear kernels were used.
The generalization capabilities for SVMfeature extrac
tors reached 95{97.5%(for dierent features and kernel
functions).
The features are extracted one by one,rst both
of the eyes and the mouth,starting at locations pre
dicted by the wavelet statistics system.To provide for
a possible prediction error,the wavelet statistics sys
tem gives a ranked list of hints for each feature,which
are checked in order,or composed together if laying
nearby.The use of wavelet statistics subsystem speeds
up the whole recognition.
Other feature positions are predicted,basing on fea
tures already found,with a use of a face geometry de
ned in fuzzy terms:we dened fuzzy functions for the
expected distance between eyes,eyes and the nose,nose
and the mouth,inclusion of the nose in the triangle
formed by the eyes and the mouth.These conditions
are dened as trapezoidal functions.
3.2.Facevector construction and recognition
After the extraction of candidates for each of the
features the facevectors are constructed.This is done
again with the help of the face geometry:only those
feature conglomerates are selected which fulll all the
criteria dened,ie.all membership functions have pos
itive values.Then a number of these with highest mem
bership value (a minimumof all separate values  fuzzy
AND function) are selected;see Figure 3.
As the SVMis a binary classier,one SVMmachine
was built for each person in the database,detecting
that person among the others.300 best facevectors
were used for each image during training.The use of a
high number of examples makes up for small variations
during features extraction,eg.an eye can be detected
together with an eyebrow,or without.
During recognition the same path is followed:rst
the features have to be extracted,then the resulting
facevectors are recognized using the trained SVMs.
Figure 3.Some faces and facevectors ex
tracted.In some,subtle differences among
face vectors extracted fromthe same images
can be seen.
Again,a number of facevectors with best fuzzy ge
ometry membership values are used.If a facevector is
positively recognized by more than one machine,then
the one with highest activation (margin of error) is se
lected,and after all are tested,the person with most
detections is considered to be recognized in the image.
4.Experimental Results and Conclusion
In our experiments,we have used the AT&T (for
merly ORL) face database [8],which consists of 400
frontal images of 40 persons (10 images each).Images
are 92x112 pixels in 256 grey levels.The feature SVM
detectors were trained on randomly selected 32 images
with hand marked features.Then,the data set was
divided into 2 equal components:200 images in the
training set (5 images of each person),and the rest in
the test set.
Our database used for training the wavelet statistics
model comprises of 400 Asian face images (200 male
and 200 female).All the images are 512 512 frontal,
upright and the feature points being manually pointed
for all ve components (eyes,brows and mouth).The
Daubechies wavelet is selected for the image wavelet
decomposition.Then,the algorithm from section 2.5
is applied to the AT&T database to search for the can
didates of eyes and mouth.For all the 400 images,the
number of correctly located is 325,42,21,9 and 3 at
the rst to fth candidates.In this experiment,our al
gorithm was applied to dierent people in various sizes
with poses changing.The algorithm is very fast (less
than 0:1 second for each image) with good results being
obtained.
Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition (FGR02)
0769516025/02 $17.00 © 2002 IEEE
left eye
right eye
nosebridge
mouth
nose
95:0%
91:8%
97:0%
95:4%
96:3%
95:6%
95:8%
97:3%
95:9%
95:7%
96:0%
96:0%
97:5%
96:1%
96:5%
Table 1.Feature extraction generalization
rates for linear,polynomial with degree 2 and
polynomial with degree 3 kernels.
The feature SVM extractors were trained with a
bootstrapping method as described before,using poly
nomial kernel of degrees 1,2,and 3,which achieved
generalizations of 95{97.5%,see Table 1.As a second
SVM was used in case the rst could not nd a fea
ture (and judging from the detection of other features
it should be there),and the SVMpairs were selected on
the basis of minimal correlation of classications,the
actual generalization rates are even higher.Features
were extracted at 5 dierent scales.The combination of
SVM extractors,face geometry during extraction and
facevector construction resulted in perfect face vectors
for all the images.
For the actual recognition 40 SVMs had to be
trained  each to recognize one person against all other;
this is due to the fact that SVM is a biclass classier.
The results for the test set for polynomial kernel with
degree 2,a linear one,linear with extended features are
shown in Table 2 compared with the whole image ap
proach on the same data,which are lower and suered
greatly when the face in the image was shifted,which
is not the case for componentbased approach.Us
ing a complex SOM{convolutional network,Lawrence
et al.have achieved a 94:25% correct hit rate on the
same data set [3].The extended in Table 2 stands for
a system where feature sizes were extended  the same
features as before were extracted,but larger areas were
included in the facevector.In that mode,components
of the cheek were included too,without the need to
specially extract them.This,and the addition of other
features will help in producing better recognition rates.
Dierent kernels were used,but the linear (ie.poly
nomial with degree 1) achieved the best generalization
results which hints,that the face recognition problem
with a component based approach becomes linearly
separable.The linear SVMs had the smallest number
of support vectors,which is also important for speed.
The system is still slow.But the addition of the
wavelet subsystem gives at least a twofold increase in
speed,cutting down the time needed for face compo
nents extraction,as well as time needed to build viable
face vectors when many features combinations fullled
face geometry constraints.
correctly
badly
not
recognized
recognized
recognized
polynomial
84:0%
3:0%
13:0%
linear
89:0%
1:5%
9:5%
extended
91:0%
2:5%
6:5%
whole image
83:5%
0:0%
16:5%
Table 2.Some recognition results on the test
set.The linear kernel achieved better results
than the polynomial one,and much better
than the whole image approach.not recog
nized stands for images not recognized by any
of the machines.
References
[1] R.Brunelli and T.Poggio.Face recognition:Features
versus templates.IEEE Trans.on Pattern Analysis
and Machine Intelligence,15(10):1042{1052,October
1993.
[2] B.Heisele,M.Pontil,and T.Poggio.Component
based face detection in still grey images.Technical
Report A.I.Memo 1687,MIT Articial Intelligence
Laboratory,2000.
[3] S.Lawrence,C.L.Giles,A.T.Choi,and A.D.Black.
Face recognition:A convolutional neural network ap
proach.IEEE Trans.on Neural Networks,8:98{113,
January 1997.
[4] S.Mallat.A theory for multiresolution signal decom
position:the wavelet representation.IEEE Trans.on
Pattern Analysis and Machine Intelligence,11(7):674{
693,July 1989.
[5] S.Mallat.A Wavelet Tour of Signal Processing.Aca
demic Press,New York,1998.
[6] H.Moon and P.J.Philips.Analysis of PCAbased face
recognition algorithms.In K.W.Bowyer and P.J.
Philips,editors,Empirical Evaluation Techniques in
Computer Vision,pages 57{71.IEEE Computer Soci
ety Press,Los Alamitos,CA,1998.
[7] P.J.Philips.Support vector machines applied to face
recognition.Technical Report NISTR 6241,National
Institute of Standards and Technology,Geithesburg,
1998.
[8] F.S.Samaria and A.C.Harter.Parameterisation
of a stochastic model for human face identication.In
Preceedings of 2nd IEEE Workshop on Applications of
Computer Vision,pages 138{142,Sarasota,FL,USA,
57 December 1994.
[9] A.Tefas,C.Kotropoulos,and I.Pitas.Using support
vector machines to enhence the performance of elastic
graph matching for frontial face authentication.IEEE
Trans.on Pattern Analysis and Machine Intelligence,
23(7):735{746,July 2001.
[10] V.Vapnik.Statistical Learning Theory.John Wiley
& Sons,New York,1998.
Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition (FGR02)
0769516025/02 $17.00 © 2002 IEEE
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο