Robust Face Descriptors in Uncontrolled Settings
Kenneth Alberto Funes Mora
LEAR Team
INRIA RhˆoneAlpes
Supervisors
Cordelia Schmid,Jakob Verbeek and Matthieu Guillaumin
A Thesis Submitted for the Degree of
MSc Erasmus Mundus in Vision and Robotics (VIBOT)
¢ 2010 ¢
Abstract
Face Recognition is known to be a diﬃcult problem for the computer vision community.
Factors such as pose,expression,illumination conditions and occlusions,among others,span
a very large set of images that can be generated by a single person.Therefore the automatic
decision of whether a pair of images depict the same person or not,in uncontrolled settings,
becomes a highly challenging problem.
Due to the large quantity of potential applications,over the past years many algorithms
have been proposed,which can be separated into three categories:holistic,facial feature based
and hybrid.Even though some algorithms have achieved a high accuracy,there is still the need
for a signiﬁcant improvement to achieve robustness in uncontrolled conditions while achieving
a high computational eﬃciency.
In this thesis we explore the use of a Histogramof Oriented Gradients as a holistic descriptor.
The experimental results show that a considerable performance is achieved when a proper set
of parameters are combined with a prior face alignment.The classiﬁcation function is given by
a metric learning algorithm,i.e.an algorithm which ﬁnds the best Mahalanobis distance that
separates the input data.
Additionally a facial feature based descriptor is presented,which is the concatenation of
SIFT descriptors,computed in the location of interest points found by a facial feature detection
algorithm.More importantly,a method to handle occlusions is proposed,where a conﬁdence
is obtained from each facial feature and later combined into the classiﬁcation function.Also,
nonlinear strategies for face recognition are discussed.
Finally it is shown that there is complementary information between both descriptors,as
their combination improves the performance such that it becomes comparable to the current
state of the art algorithms.
Contents
Acknowledgments iii
1 Introduction 1
1.1 Problem deﬁnition...................................2
1.2 Outline and contributions...............................4
2 Related work 5
2.1 Marginalized kNearest Neighbors...........................5
2.2 Automatic naming of characters in TV video....................6
2.3 Attribute and simile descriptor for face identiﬁcation................8
2.4 Face recognition with learning based descriptor...................10
2.5 Multiple oneshots using label class information...................12
3 The face recognition pipeline 14
3.1 Face detection......................................14
3.2 Facial features localization...............................15
3.3 Face alignment.....................................17
3.4 Preprocessing for illumination invariance.......................19
3.5 Face descriptor.....................................20
3.6 Learning/Classiﬁcation.................................23
3.7 Datasets and evaluation................................26
i
3.8 Baseline performance..................................27
4 Histogram of Oriented Gradients for face recognition 29
4.1 Motivation.......................................29
4.2 Alignment comparison.................................30
4.3 HoG parametric study.................................31
4.4 Discussion........................................35
5 Facial feature based representations 36
5.1 Motivation.......................................36
5.2 Feature wise classiﬁcation...............................37
5.3 Nonlinear approaches.................................41
6 Combining face representations 46
6.1 Results for LFW....................................47
6.2 Results for PubFig...................................48
7 Conclusions and future work 50
Bibliography 55
ii
Acknowledgments
First of all I want to thank all the people who brought the Vibot program into existence and
that every year work very hard for its improvement.The coordinators Fabrice Meriadeau,
David Foﬁ,Joaquim Salvi,Jordi Freixenet,Robert Mart´ı and Yvan Petillot and every single
one of the lecturers and administrative staﬀ.Without your eﬀort and initiative we would not
be here.
To my supervisors:Cordelia Schmid,Jakob Verbeek and Matthieu Guillaumin.I feel very
thankful for receiving me in the LEAR team,and for your valuable guidance,which helped me
to grow in knowledge and experience.As well to all the members of the LEAR team,for their
friendship and for making these months a very gratifying and enrichful experience.
To all my Vibot colleages,I have learned too many things from every single one of you.The
cultures you were representing,your diﬀerent world views and your experience.It is one of the
things that I would never forget of the Vibot program.It helped to grow in so many ways that
I can not express.The world is a small place but it contains great people.Your friendship will
be always alive and I hope we will be meeting in the future.
I want to thank my friends at home,who have been in contact with me all this time.Always
willing to listen,always willing to advice,always willing to talk.Deﬁnitely a true friend is not
separated by the distance.You guys know who you are...
I want to thank my family,my parents Carlos Funes and Ruth Mora,and my brother
Michael Funes for their support at the distance and encouranging words in moments of need,
¡Pap´a,Mam´a,Michael,los amo enormemente!,¡Gracias!.
I would like to thank my God and saviour Jesus Christ,you are the source of my strength
and my motivation,you take me by the hand when I need it the most.Thank you...
Last but not least,to the European Commission for funding my studies during these two
years.
iii
Chapter 1
Introduction
Face Recognition can be divided in two main applications:Face Identiﬁcation and Face Veriﬁ
cation.The former refers to the association between a set of probe faces and a gallery,in order
to determine the identity of each of the exemplars from the probe set.The latter refers to the
decision of whether a pair of face instances correspond or not to the same person.This deﬁni
tion is diﬀerent to that of visual identiﬁcation [13],where the term identiﬁcation is used for the
pair matching problem.It can be noticed that the face veriﬁcation problem is wider,in such
way that the face identiﬁcation task can be formulated by solving face veriﬁcation subproblems.
Within this thesis we focus on face veriﬁcation.Therefore,the goal is to design an algorithm
to automatically decide,whether a pair of unseen face images,depict the same person or not.
It is a supervised classiﬁcation problem,in which the decision function is trained based on a set
of example faces labeled with identities,or pairs of face images labeled as similar or dissimilar.
The availability of a solution to this problem is highly attractive for its many applications.
It comprises ﬁelds such as entertainment,smart cards,information security,law inforcement,
surveillance,etc.[45].Within the context of scene interpretation,we want to be able to auto
matically determine what is happening in an image or a video [35].Face recognition is highly
valuable as it helps to determine the question of Who is in the scene [11,20].This will open
the possibility for applications such as categorization,retrieval and indexing based on iden
tity [15,16].The use of face recognition technology is becoming more and more visible,e.g.the
recent launch of tools for automatic face labeling in sites such as Picasa
1
.
More than 35 years in research have generated many algorithms [1,3,7,11,15,16,22,26,35,
36,38,40,42,45] and benchmarks [19,22,27,33],which have pushed face recognition to achieve
outstanding results,a proof is the current availability of commercial software [28].In general,
this software is designed for the case in which the person cooperates in the image acquisition
1
Picasa Web Albums,http://picasaweb.google.com/
1
Chapter 1:Introduction 2
(a)
(b)
(c)
(d)
(e)
(f)
Figure 1.1:Face variations due to:(a) Viewpoint changes (b) Illumination variation (c) Occlu
sions (d) Expression (e) Age variations (f) Image quality
in a controlled environment,and therefore there are no major changes in illumination,pose,
expression,etc.However,face recognition in uncontrolled settings,from still images and videos,
is still an unsolved problem.Despite the large amount of research carried out,a signiﬁcant
improvement is still required in order to achieve robustness in such settings.
The main challenge is that a single person can virtually generate an inﬁnite number of
images.This is due to the many factors that inﬂuence the image acquisition.Among the most
important are:major pose or viewpoint changes,including scaling diﬀerences,variations in the
illumination conditions,the possibility of occlusions due to sunglasses,hats and other objects,
diﬀerences in expression,aging,changes in hair and facial hair and image quality.Figure 1.1
shows examples of how this factors aﬀect the resulting image.
1.1 Problem deﬁnition
Even though many algorithms can be found in the literature,a general pipeline can be identiﬁed,
shown in Fig.1.2.Its steps are intended to overcome the challenges previously mentioned.Face
detection is the ﬁrst step,it deﬁnes a bounding box for the location and scale of the face.Then
three optional steps can be applied:alignment,facial feature localization and/or preprocessing
to gain invariance to illumination.The goal is to build a visual descriptor that can be used as
the input for machine learning algorithms.These algorithms are capable of classifying a pair
of examples as belonging to the same individual or not.Three categories of algorithms can be
identiﬁed:Holistic,Feature based and Hybrid approaches.
3 1.1 Problem deﬁnition
Facial features
extraction
(optional)
Face
detection
Alignment
(optional)
Visual feature
extraction
Illumination
normalization
(optional)
Face
identiﬁcation
Figure 1.2:General face recognition pipeline
Holistic face description methods consider the face image as a whole to build the descriptor.
Examples of such approaches are the subspace learning algorithms,where a face is represented
as a point in a high dimensional space,with the intensity of each pixel as one dimension,
followed by the use of techniques such as Principal Component Analysis (Eigenfaces) [38] or
Linear Discriminant Analysis (Fisherfaces) [3].In such cases,the objective is to project the data
into a lower dimensional space where most of the information is maintained (PCA) or the dis
criminant information between diﬀerent classes (people) is emphasized (LDA) when computing
the projection matrix.Bayesian methods also fall into this category,refering to those meth
ods that generate a Maximum a Posteriori (MAP) estimation of a intrapersonal/extrapersonal
classiﬁer [24].
Aditionally,proposals has been presented to unify Bayesian approaches with Eigenfaces
and Fisherfaces [40].These algorithms have shown to provide good results under controlled
conditions,using benchmarks such as the FERET database [33].However,they are not suitable
for uncontrolled settings,where high nonlinearities are introduced,e.g.as a result of major
pose changes,and are sensitive to the localization given by the face detector.
Proposals have been presented to improve the performance in uncontrolled conditions,by
creating more complex descriptors than simply the set of pixel values,e.g.using Local Binary
Patterns [1] or by extending subspace learning to handle nonlinear data,using the kernel
trick [6,44].Additionally through methods specialized in nonlinear dimension reduction,by
learning an invariant mapping [17].In this thesis,a holistic approach based on Histogram of
Oriented Gradients (HoG) will be presented in Chapter 4.
Feature based face description algorithms are grounded in the localization of a set of facial
features,such as the position of the mouth,the eyes,the nose,etc,after face detection [11,29].
A descriptor is built using the location information.In the past years,algorithms based on
Facial Features localization have gained a growing attention [7,10,11,16,22],as they are less
sensitive to pose variations and misalignments introduced by the face detector.
Therefore they are appropriate for the face recognition tasks in uncontrolled settings.How
ever,the facial feature localization itself is still problematic,and needs further improvements.
In this thesis a featurebased algorithm using multiscale SIFT [16,23] will be presented,and
Chapter 1:Introduction 4
compared to the Holistic approach based on HoG descriptors.
Hybrid face description methods combine holistic and feature based paradigms,through
either early or late fusion.Early fusion refers to the case in which descriptors are combined
into one using aggregation methods,such as concatenation of the feature vectors.In this case,
the information is combined prior to classiﬁcation.Late fusion makes a classiﬁcation based on
each descriptor,and their corresponding scores are combined into one,to make a more robust
decision.In this thesis we use a late fusion method,which combines the HoG and multiscale
SIFT descriptors.
1.2 Outline and contributions
In Chapter 2 diﬀerent state of the art algorithms are described in detail.These were identiﬁed
for being the current state of the art for challenging benchmarks such as the Labeled Faces in
the Wild [19] dataset,or because they were an important inﬂuence for our work.In Chapter 3
there is a detailed description of the face recognition pipeline from Fig.1.2.Each of the stages
are described,together with algorithms for their implementation.
The ﬁrst contribution is given in Chapter 4,where we explore the use of a Histogram of
Oriented Gradients descriptor for face recognition.We show in this chapter that an alignment
robust regarding translations is necessary to obtain a good performance.Furthermore,we
identify set of parameters for which a highest accuracy is achieved.
Our second contribution,described in Chapter 5,is related with feature based algorithms.
We propose a strategy in which learning is done for each facial feature,after which we combine
them through late fusion.Even though this does not help the overall performance,it is good
to handle occlusions.This is done by detecting outliers based on a discriminative appearance
model.The occlusion information is later on inserted into the classiﬁcation function.
The third contribution is showed in Chapter 6,where we combine the use of HoG and mul
tiscale SIFT representations through late fusion.This combinations increases the performance
of the algorithm such that it is comparable to the state of the art.Finally,in Chapter 7,we
give a summary of our work pointing out the main conclusions,from which we deﬁne our future
work.
Chapter 2
Related work
In Chapter 1 diﬀerent face recognition algorithms were mentioned.We identiﬁed a few methods
that have given promising results in uncontrolled settings,and are recognized as the state of
the art.These algorithms are described in more detail in this chapter.
2.1 Marginalized kNearest Neighbors
Guillaumin et al.proposed the use of metric learning approaches for face recognition [16],more
speciﬁcally,Logistic Discriminant Metric Learning (LDML),an algorithm that searches for
the best Mahalanobis distance between pairs of feature vectors,explained in more detail in
Section 3.6.2.
Even though LDML has proven to be eﬀective,any Metric Learning algorithm will generate
a linear transformation of the input space.However data,for face recognition,is believed to
be highly non linear,due to major changes in pose and expression.Therefore,metric learning
approaches might not be able to eﬀectively separate the classes.To overcome this problem,
Guillaumin et al.proposed a modiﬁcation of kNearest Neighbors (kNN).In kNNclassiﬁcation,
an unseen example is assigned to the class with most occurrence within its k neighbors,that
are deﬁned according to some measure,e.g.minimim Euclidean distance.
If n
i
c
denote the quantity of neighbors of x
i
belonging to class c.Then the probability of x
i
to be of class c is estimated as p(y
i
= cx
i
) = n
i
c
/k.The proposal is to classify the pair (x
i
,x
j
)
as belonging to the same class by marginalizing over all the possible classes within the training
set.This is shown in Eq.(2.1).
p(y
i
= y
j
x
i
,x
j
) =
X
c
p(y
i
= cx
i
)p(y
j
= cx
j
) =
1
k
2
X
c
n
c
i
n
c
j
(2.1)
5
Chapter 2:Related work 6
This result can be thought as a binary kNearest Neighbors classiﬁer in the implicit space
of N
2
pairs.This can be observed in Fig.2.1,where for each point of the pair to be classiﬁed,
their k neighbors are selected and then the vote is given by all the pairs that can be generated
from their neighbors,divided by the quantity of possible pairs Eq.(2.1).
The descriptors used in [16] were Local Binary Patterns (LBP) [42] and SIFT [23],computed
at 3 scales in the locations given by the facial feature localization algorithm,i.e.the corner of
the eyes,nose and mouth.The metric used to deﬁne the neighborhood was given by a Large
Margin Nearest Neighbors [41].An algorithm designed to ﬁnd a metric speciﬁcally optimized
for the kNN problem.
A
B
C
x
i
x
j
12pairs
6pairs
6pairs
24 pairs
A
C
B
Figure 2.1:Marginalized K Nearest Neighbors [16]
2.2 Automatic naming of characters in TV video
Everinghamet al.[11] considered the problemof automatic naming of characters in video.They
combined information such as subtitles and scripts to determine which characters are present
in the scene and when.Using visual information are able to associate a name to each character
for certain tracks.These tracks are used as well to generate a set of training examples for
a face recognition algorithm,used to determine the identity of characters from the remaining
unlabeled tracks.
In this case,the problem is simpler in terms of face recognition,tracking can be used to
associate faces in a sequence of frames.Moreover,video can easily generate a large amount of
training examples,and generally,there is a small amount of characters to recognize.
The ﬁrst step is to align the script (dialoguecharacter) with the subtitles (dialoguetiming)
to determine which characters are talking and when.Then they proceed to obtain face tracks,
that are face detections linked as the same person over a group of not necessarilly sequential
frames.This is done using a KanadeLucasTomasi (KLT) tracker [34],this algorithm uses a
interest point detector for the ﬁrst frame and then propagates the points over the following
7 2.2 Automatic naming of characters in TV video
(a)
(b)
Figure 2.2:(a)Example of face tracking to build the training set (b) Features Patches extraction
[11]
frames.Based on the tracked interest points,which follow a path intersecting face detections,
the creation of the face tracks are obtained as seen in Fig.2.2a.The face tracking is done
separately for each shot of the whole video,where a change of shot is detected by thresholding
the diﬀerence of color histograms between succesive frames.Notice that this simpliﬁes the
problem of face matching and no real face recognition is done yet.
In order to build a face descriptor,the facial feature detector,described in detail in Section
3.2 is used.The pixel values surrounding each localization are extracted,as showed in Fig.2.2b,
normalized to have zero mean and unitary variance,in order to acquire photometric invariance.
Using the localization of the mouth,a speaker detection is used,simply by computing the
variation of the mouth pixels in sequential frames and thresholding.Additionally to facial
information,clothing information is used,with a color histogram for a bounding box below the
face detection.Finally knowing which face track is speaking and associating it with the script
and subtitle information,a set of face tracks can be properly labeled with an identity.These
tracks can be used as training examples for a classiﬁcation problem,in order to label the rest
of the face tracks that could not be labeled in the previous steps.
To label the rest of face tracks,a similarity measure comparing two characters combines
facial and clothing information,as given in Eq.(2.2)
S(p
i
,p
j
) = exp
µ
−
d
f
(p
i
,p
j
)
2σ
2
f
¶
exp
µ
−
d
c
(p
i
,p
j
)
2σ
2
c
¶
(2.2)
Taking into account this similarity measure,a classiﬁcation based on Nearest Neighbors or
Support Vector Machines can be used to label the rest of face tracks in the video.More details
Chapter 2:Related work 8
Table 2.1:Low level features parameters for a single trait classiﬁer
Pixel Value Types
Normalization
Aggregation
RGB(r)
None(n)
None(n)
HSV (h)
MeanNormalization (m)
Histogram (h)
Image Intensity (i)
EnergyNormalization (e)
Statistics (s)
Edge Magnitude (m)
Edge Orientation (o)
can be found in [11].
2.3 Attribute and simile descriptor for face identiﬁcation
The work presented by Kumar et al.[22] has presented one of the best results for the Labeled
Faces in the Wild benchmark,when using the “restricted” protocol (explained in Section 3.7.1).
They presented two separate strategies:the attribute and the simile classiﬁer.
2.3.1 Attribute descriptor
The attribute classiﬁer algorithm is based on the idea that a person’s identity can be infered
from a set of high level attributes,such as gender,age,race,etc.The result is a descriptor
with entries according to each of the attributes,as shown in Fig.2.3a.Each trait is determined
using the algorithm in [21]:the face image is divided into regions,as shown in Fig.2.3c.The
aim is to have a set of low level features that are created by the combination of a region,using
a speciﬁc pixel value type,normalization and aggregation.The options are listed in Table 2.1.
The selection of which combinations to use is trait dependent.
Kumar et al.proposed to use forward feature selection to knowwhich lowlevel features to se
lect for a given trait.Then a SVMclassiﬁer with RBF Kernel is trained concatenating the useful
lowlevel features.In [22],the low level descriptor is deﬁned as F(I) = hf
1
(I),f
2
(I),...,f
k
(I)i
where f
i
(I) represent the feature i of image I,a selection fromTable 2.1.The attribute descrip
tor is build using the output of the trait classiﬁers as x
i
= hC
1
(F(I
i
)),C
2
(F(I
i
)),...,C
n
(F(I
i
))i.
Finally the recognition function is given in Eq.(2.3)
f(I
i
,I
j
) = D(x
i
,x
j
) (2.3)
With D(x
i
,x
j
) as a classiﬁcation function,described in Section 2.3.3,such that the output
is positive for the same identity and negative for diﬀerent identities.
9 2.3 Attribute and simile descriptor for face identiﬁcation
(a)
(b)
(c)
Figure 2.3:(a) Descriptor based on high level attributes (b) Training examples for the attributes
(c) Face Regions for the attribute classiﬁers [21]
2.3.2 Simile descriptor
A problem with the attribute classiﬁer is that a signiﬁcant amount of annotation must be
done,and only features that can be described with words such as gendre must be used.Simile
descriptors are based on the intuition of describing a person based on similarities with reference
individuals.For example:“Nose similar to subject 1” and “Mouth Not similar to subject 2”.To
create such description,a set of reference face images was created.A classiﬁer is trained based
on at least 600 positive examples for each feature and at least 10 times more negative examples.
The ﬁnal descriptor is depicted in Fig.2.4a,while Fig.2.4b show some training examples.
For a pair of unseen examples,their respective simile feature vectors,x
i
and x
j
,are com
puted.Then a classiﬁer is used to take the decision of whether they depict the same person
(Eq.(2.4)).
f(I
i
,I
j
) = D(x
i
,x
j
) (2.4)
2.3.3 Veriﬁcation classiﬁer
Both Eq.(2.3) and Eq.(2.4) use the same algorithm,which is a Support Vector Machine classiﬁer
optimized to give higher importance to the sign than to the absolute value of the entries of the
Chapter 2:Related work 10
(a)
(b)
Figure 2.4:(a) Descriptor based on similarity of features (b) Training examples for the features
descriptor.This is done based on the observation that the trait classiﬁers are designed to be
binary outputs,in the range [−1,1].
To do that they proposed to generate pairs p
i
= (a
i
− b
i
,a
i
.b
i
)g(
1
2
(a
i
+ b
i
)),where
a
i
= C
i
(I
1
),b
i
= C
i
(I
2
) and g(z) is a Gaussian weighting.The concatenation of all the pairs
generate the feature vector that is used for an SVM RBF classiﬁer.Even though these algo
rithms have both achieved outstanding results for Labeled Faces in the Wild,they do not follow
the strict evaluation protocol as they use training data not available in the Labeled Faces in
the Wild dataset.It also has the disadvantage of using a large set of classiﬁers just to build the
descriptor.This is not desirable in terms of computational eﬃciency.
2.4 Face recognition with learning based descriptor
Recently,Cao et.al [7] introduced a novel method which is comparable to the best performing
algorithms for Labeled Faces in the Wild.It brings two main contributions,the ﬁrst one is
that there is no manually deﬁned descriptor,but a proper encoding is learned speciﬁcally for
facial images,in an unsupervised manner.The second contribution consist in a pose dependent
classiﬁcation.
As illustrated in the top part of Fig.2.5b,the descriptor is learned as follows:a sampling
method is deﬁned in which,for every pixel,its neighbors are retrieved in a predeﬁned pattern,
to form a low level vector.Examples can be observed in Fig.2.5a where diﬀerent options for
patterns are presented.The sampling is done for every pixel in the image,for all the images in
the training set,and therefore each pixel will have an associated low level feature vector.
A vector quantization algorithm is used,which might be KMeans,PCAtree or random
projection tree.Empirically they found that randomprojection tree gives better results.The
11 2.4 Face recognition with learning based descriptor
R
1
R
2
(2)
(3)
R
1
R
2
(4)
R
1
(1)
R
1
(a)
L
a
n
d
m
a
r
k
d
e
t
e
c
t
i
o
n
C
o
m
p
o
n
e
n
t
a
l
i
g
n
m
e
n
t
.
.
.
Chapter 2:Related work 12
is used.They showed with their results that this also brings an improvement in the accuracy
of the classiﬁcation.
2.5 Multiple oneshots using label class information
This method,introduced by Taigman et al.[36],is based in the oneshot similarity score (OSS).
The OSS score is computed as follows:a set of face examples A is obtained,this has to be
exclusive to the images to be compared in terms of identity.Then,if a pair of images x
i
and x
j
is to be classiﬁed,ﬁrst a discriminative classiﬁer f
i
is trained,using image x
i
as a single positive
example and the set A as the negative examples.The process is repeated for x
j
to obtain a
classiﬁer f
j
.The OSS score is the average of the cross classiﬁcation,i.e.s = (f
i
(x
j
)+f
i
(x
j
))/2.
The work from Taigman [36] is an extension of this method which beneﬁts from the use of
label information.The proposal is to split the set A according to the identities,such that we
have A
i
,i = {1,2,...,n}.Then to create a single OSS score from each of the subsets to build
a multiple oneshot vector.The motivation is to make classiﬁers which are more discriminative
towards identity than to other factors,such as pose.If a subset of A
i
has images of only one
person and there is variety regarding factors such as pose,expression,etc.then the classiﬁer
will be more likely to discriminate identity.In the case a factor such as pose is constant within
the subset A
i
,then the OSS score will not be discriminative towards identity,but to pose,
however they argue this information is beneﬁcial when combining a large set of OSS scores into
the multiple onescore vector.In such way that they also created subsets of images sharing the
same pose to create more OSS scores.
The pipeline for this algorithmcan be observed in Fig.2.6,and it is described as follows:the
two images being compared are aligned,using a similar strategy to that of Section 3.3.2,from
Figure 2.6:the multiple oneshot pipeline
13 2.5 Multiple oneshots using label class information
which they create a feature vector.They tested with SIFT with a dense sampling,Local Binary
Patterns (LBP),the threepatch and the fourpatch LBP [42].PCA is later used to reduce the
dimensionality of the descriptor.Then Information Theoretic Metric Learning (ITML) is used to
learn a Mahalanobis distance d(x
i
,x
j
) = (x
i
−x
j
)
⊤
S(x
i
−x
j
),which generates a distance above
certain threshold for negative pairs while maintaining the distance below another threshold for
positive pairs [9].The learned matrix can be factorized using a Cholesky decomposition,as
S = G
⊤
G,from which the matrix G is used to project the feature vectors.In the new space,
the computation of the Euclidean distance is equivalent to computing the Mahalanobis distance
in the previous space.The metric and the PCA projection are obtained from the training set
prior to the computation of the OSS scores.
Finally,for a pair of face images to be classiﬁed their feature vectors,projected using the
matrix G,are used to generate multiple OSS scores using the subsets A
i
,these are concatenated
to create a vector which is fed into a SVM classiﬁer.
This algorithm has currently the highest accuracy reported for the Labeled Faces in the
Wild benchmark,in the “unrestricted” protocol,explained in Section 3.7.1.However,notice
the computation of OSS scores is very expensive,as many diﬀerent discriminative models have
to be trained in order to create the multiple OSS score vector.
Chapter 3
The face recognition pipeline
In this chapter the pipeline depicted in Fig.3.1 is discussed in more detail.The function of
each stage is described,and relevant algorithms for their implementation are presented.
3.1 Face detection
Face detection is the search of location and scale of instances of human faces within an arbitrary
image.Again,the diﬃculty is to perform well in the presence of factors that aﬀects images
acquired in uncontrolled conditions (c.f.Fig.1.1).Viola & Jones [39] proposed an eﬃcient
algorithm for face detection,based on Haar Wavelet Features and a cascade of classiﬁers,
selected by the Adaboost algorithm.
Adaboost [14] is an algorithm designed to create a “stronger classiﬁer” from a set of “weak
classiﬁers” through their linear combination.The algorithm iteratively selects,from the weak
classiﬁers space,the one which minimizes a distributed error over the training data.The
assigned weight to the selected classiﬁer is dependent on the error and,at each iteration,the
distribution is updated,in such way that,the training examples which were misclassiﬁed,are
given higher importance in the following iterations.
Facial features
extraction
(optional)
Face
detection
Alignment
(optional)
Visual feature
extraction
Illumination
normalization
(optional)
Face
identiﬁcation
Figure 3.1:General face recognition pipeline
14
15 3.2 Facial features localization
¡ ¡ ¡ ¡
¡ ¡ ¡ ¡
¡ ¡ ¡ ¡
¡ ¡ ¡ ¡
¡ ¡ ¡ ¡
¡ ¡ ¡ ¡
¡ ¡ ¡ ¡
¢¡ ¢¡ ¢¡ ¢¡ ¢
¢¡ ¢¡ ¢¡ ¢¡ ¢
¢¡ ¢¡ ¢¡ ¢¡ ¢
¢¡ ¢¡ ¢¡ ¢¡ ¢
¢¡ ¢¡ ¢¡ ¢¡ ¢
¢¡ ¢¡ ¢¡ ¢¡ ¢
¢¡ ¢¡ ¢¡ ¢¡ ¢
¢¡ ¢¡ ¢¡ ¢¡ ¢
£¡ £¡ £
£¡ £¡ £
£¡ £¡ £
£¡ £¡ £
£¡ £¡ £
£¡ £¡ £
£¡ £¡ £
£¡ £¡ £
¤¡ ¤¡ ¤
¤¡ ¤¡ ¤
¤¡ ¤¡ ¤
¤¡ ¤¡ ¤
¤¡ ¤¡ ¤
¤¡ ¤¡ ¤
¤¡ ¤¡ ¤
¤¡ ¤¡ ¤
¤¡ ¤¡ ¤
¥¡ ¥¡ ¥
¥¡ ¥¡ ¥
¥¡ ¥¡ ¥
¥¡ ¥¡ ¥
¥¡ ¥¡ ¥
¥¡ ¥¡ ¥
¥¡ ¥¡ ¥
¥¡ ¥¡ ¥
¥¡ ¥¡ ¥
¥¡ ¥¡ ¥
¥¡ ¥¡ ¥
¥¡ ¥¡ ¥
¥¡ ¥¡ ¥
¥¡ ¥¡ ¥
¥¡ ¥¡ ¥
¥¡ ¥¡ ¥
¥¡ ¥¡ ¥
¥¡ ¥¡ ¥
¦¡ ¦¡ ¦
¦¡ ¦¡ ¦
¦¡ ¦¡ ¦
¦¡ ¦¡ ¦
¦¡ ¦¡ ¦
¦¡ ¦¡ ¦
¦¡ ¦¡ ¦
¦¡ ¦¡ ¦
¦¡ ¦¡ ¦
¦¡ ¦¡ ¦
¦¡ ¦¡ ¦
¦¡ ¦¡ ¦
¦¡ ¦¡ ¦
¦¡ ¦¡ ¦
¦¡ ¦¡ ¦
¦¡ ¦¡ ¦
¦¡ ¦¡ ¦
¦¡ ¦¡ ¦
¦¡ ¦¡ ¦
§¡ §¡ §
§¡ §¡ §
§¡ §¡ §
§¡ §¡ §
§¡ §¡ §
§¡ §¡ §
§¡ §¡ §
§¡ §¡ §
¨¡ ¨¡ ¨
¨¡ ¨¡ ¨
¨¡ ¨¡ ¨
¨¡ ¨¡ ¨
¨¡ ¨¡ ¨
¨¡ ¨¡ ¨
¨¡ ¨¡ ¨
¨¡ ¨¡ ¨
¨¡ ¨¡ ¨
©¡ ©¡ ©
©¡ ©¡ ©
©¡ ©¡ ©
©¡ ©¡ ©
©¡ ©¡ ©
©¡ ©¡ ©
©¡ ©¡ ©
©¡ ©¡ ©
¡ ¡
¡ ¡
¡ ¡
¡ ¡
¡ ¡
¡ ¡
¡ ¡
¡ ¡
¡ ¡
A
B
C
D
(a)
A
C
B
D
1
4
2
3
(b)
Figure 3.2:ViolaJones object detection based on Haar Features (a) Examples of Haar Features
(b) Feature computation from the integral image.Notice that the area marked as D can be
computed using the points 1,2,3 and 4 from the Integral Image.D = 4 +1 −(2 +3) [39]
Their algorithm has the advantage of providing a fast way to extract the Haar wavelets by
precomputing what is called an Integral Image,Eq.(3.1).This is possible due to the rectangular
geometry of Haar Wavelets (Fig.3.2a),which can be later on computed by adding a few terms
from the Integral image (Fig.3.2b).This is an important asset for detection,as diﬀerent Haar
Filters must be computed in many locations and scales within the probe image.
˜
I(x,y) =
Z
x
′
≤x,y
′
≤x
I(x
′
,y
′
) (3.1)
While this algorithm is widely used because of its accuracy and speed,the implementation
used in this thesis is an extension of the ViolaJones algorithm.Besides Haar features,Histogram
of Oriented Gradients (HoG) features (see Section 3.5.1) have been used.The advantage of using
HoG features is that the same concept of the integral image can be applied,by creating the
Integral Histogram [32,47].This strategy boosts the speed of the algorithm,which beneﬁts in
terms of robustness from the use of additional features.Fig.3.3 show some examples of face
detections.
3.2 Facial features localization
Facial feature point localization is the ﬁrst step in feature based algorithms.Its robustness
is crucial for performance.The detector being used in this thesis is the one from [11],which
is an improvement over the pictorial structure model [12].The algorithm must maximize the
following measure:
Chapter 3:The face recognition pipeline 16
(a)
(b)
(c)
Figure 3.3:(a) Correct face detections (b) Example of missed detection due to large pose
variation (c) Incorrect detections due to a cluttered region
p(Fp
1
,...,p
n
) ∝ p(p
1
,...,p
n
F)
n
Y
i=1
p(a
i
F)
p(a
i

F)
(3.2)
Eq.(3.2) shows the probability of having the set of features F given a localization (p
1
,...,p
n
).
This is proportional to the probability of having such localization (whether the relative posi
tioning of points is possible according to the expected geometry),multiplied by the ratio of
obtaining the appearance a
i
for the respective feature,over the probability of also having that
appearance given that the feature is not present.For the appearance model,there is the as
sumption of mutual independence between all the facial features,which is as well independent
of their localization,and therefore appears as a multiplication.Eq.(3.2) can be understood as
the combination of two models,one for the relative localization of the features and another for
their appearance.
The appearance ratios are modeled using a binary classiﬁer,trained with feature/non fea
tures examples.It uses Haar Wavelets and Adaboost for the combination of the weak classiﬁers
given by the Haar Features.It follows exactly the same algorithm as in Section 3.1,and the
output is substituted directly into Eq.(3.2).On the other hand the localization is modeled with
a treelike Gaussian mixture in which there is a covariance dependency in the form of a tree.
Each covariance depends on its parent node,as shown in Fig.3.4 where nodes 2,3 and 4 are
shown with an uncertainty relative to their parent node (1).
The combination of both models present a highly reliable localization,which is able to cope
with large pose variations.It is able to handle occlusions,as the expected positions compensate
for appearance problems.
As discussed in [12] the tree structure for the Gaussian Mixture Model allows for eﬃcient al
gorithms for maximizing Eq.3.2,and using the ViolaJones algorithmfor appearance modeling,
17 3.3 Face alignment
Figure 3.4:Treelike Gaussian Mixture Model for the localization of Facial Features
speeds the algorithm as well.
3.3 Face alignment
Many recognition algorithms rely on the ability of the face detector to give a standard location
and scale for the face.However,this is not always the case,standard face detectors such as
ViolaJones’s,and the one used for this project,give poorly aligned images.This is the trade
oﬀ between having the ability to detect faces with large changes in pose and expression with
alignment and localization.In order to compesate those misalignments,diﬀerent algorithms
have been proposed to bring an arbitrary facial image to a canonical pose,in which facial
features can be more easily compared.Recent algorithms have been proposed for nonrigid
transformations,such that proper positioning of the facial features are infered,despite the
pose,see Zhu et al.[46].In this section,two algorithms restricted to rigid transformations are
described.
3.3.1 Funneling
In 2007,Huang et al.[18],introduced a technique called unsupervised joint alignment.This
algorithm models an arbitrary set of images (in this case,face images) as a distribution ﬁeld,
i.e.a model for which every pixel in the image is a random variable X
i
,with possible values
from an alphabet χ,for example,the set of pixel intensities for an 8bit grayscale image,i.e.
χ = {1,2,...,256}.Then each pixel X
i
is assigned with a distribution over χ.
The ﬁrst step of the algorithm,which can be considered as training,is called congealing.
Computes the empirical distribution for each pixel,based on the stack of images,i.e.the em
pirical distribution ﬁeld.Then,for each image,it performs a transformation (e.g.an aﬃne
transformation) such that the entropy over the distribution ﬁeld is minimized.Then it recom
putes the empirical distribution ﬁeld for the transformed images and repeats the iterations until
Chapter 3:The face recognition pipeline 18
Distribution
Field 1
Distribution
Field 2
Distribution
Field n
Figure 3.5:Congealing example [18]
convergence.
Fig.3.5 illustrates the idea of congealing.The distribution ﬁeld is formed by a stack of 1D
binary images,i.e.χ = {0,1}.At each iteration,a horizontal translation will be chosen for
each image,in such way that the overall entropy is reduced.As a result,at iteration n,the
images will be at a position such that they are considered aligned.
Notice that congealing can be used directly to align a set of face images.However,it
cannot be applied for an unseen example,unless the new image is inserted into the training
set,and congealing is run again.Funneling is an eﬃcient way of doing that,the idea is to
keep the sequence of distribution ﬁelds at each iteration of congealing,and choose a sequence
of transformations for the new image,based on the distribution ﬁeld obtained at each iteration
of congealing.In [18],instead of using pixel values,SIFT descriptors were used at each pixel
location.Then kMeans is used to obtain 12 clusters,which are used as the alphabet χ.
3.3.2 Facial features coordinates based alignment
Another strategy consists in using the output of the facial features localization,i.e.the coordi
nates,to infer the necessary aﬃne transformation which will bring the facial feature points to
a canonical pose,one that will be shared among all the images.
Let x
f
= (x
f
0
,x
f
1
,1)
⊤
be the homogeneous coordinates for the feature f of a non aligned
image,and y
f
= (y
f
0
,y
f
1
)
⊤
the desired coordinates for the same feature.We want to obtain
the aﬃne transformation A(2 × 3) such that y
f
= Ax
f
.To obtain the six parameters of A
only three features are needed,however,in order to compensate for wrong localizations,all the
features can be used to obtain the set of parameters which minimize the least squares error in
localization.
If A
′
is deﬁned as the vector with the entries of A,Yis the vector with the target coordinates
Y = (y
0
0
,y
0
1
,...,y
F−1
0
,y
F−1
1
)
⊤
.And ﬁnally the matrix X,with the input coordinates for all
the features,is deﬁned as shown in Eq.(3.3).
19 3.4 Preprocessing for illumination invariance
(a)
(b)
(c)
Figure 3.6:Examples of Facial Features based alignment
X=
x
0
0
x
0
1
1 0 0 0
0 0 0 x
0
0
x
0
1
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
x
F−1
0
x
F−1
1
1 0 0 0
0 0 0 x
F−1
0
x
F−1
1
1
(3.3)
Then,for the new variables,y
f
= Ax
f
becomes Y = XA
′
.Its solution is given in Eq.(3.4)
A
′
= (X
⊤
X)
−1
X
⊤
Y (3.4)
Figure 3.6 show some examples of alignments obtained using this strategy.The disadvantage
of this approach is that facial feature localization algorithms are rather slow,and aﬀected by
high pose changes that can lead to wrong alignments.Furthermore,a single canonical pose
is not suitable for major changes in viewpoint.Within this work,the target coordinates were
obtained by averaging over the set of training examples.
3.4 Preprocessing for illumination invariance
In uncontrolled conditions the illumination setup in which the image was acquired might have
a drastic inﬂuence over the obtained descriptor.Optionally,a preprocessing stage is desirable,
in which the eﬀect of illumination conditions,local shadowing and highlights is removed,while
preserving the visual information that is important for recognition.
Tan and Triggs [37] proposed an eﬃcient pipeline to remove the eﬀects of illumination,
speciﬁcally for face recognition.First Gamma correction is used,i.e.a transformation of the
pixel graylevel values I using the nonlinear transform
ˆ
I = I
γ
,with 0 < γ < 1.This enhance
the dynamic range by increasing the intensity in dark regions and decreasing it for bright
regions.Next the image is convolved with a Diﬀerence of Gaussians (DoG) kernel,a bandpass
ﬁlter which is intended to remove gradients caused by shadows (low frequency),to suppress
Chapter 3:The face recognition pipeline 20
(a)
(b)
Figure 3.7:Preprocessing examples to gain illumination invariance (a) before preprocessing (b)
after preprocessing
noise (high frequency),and maintaining the useful signal for recognition (middle frequency).
Additionally a mask could be used to remove regions which are irrelevant for recognition.Finally
Contrast Equalization is used to have a standarized contrast spectrum for the image.This is
done carefully by removing the eﬀect of extreme values,such as artiﬁcial gradients introduced
by the masking.Fig.3.7 show examples of the resulting images after the preprocessing is
applied.In this thesis we did not consider a preprocessing step for illumination invariance,as
the used descriptors are based on gradients,and therefore are invariant to illumination shifts.
3.5 Face descriptor
The objective will be to transform an image into a feature vector x
i
∈ R
D
.This vector must
be discriminative,i.e.it must encode information that is relevant to determine the identity of
the person.The learning algorithms described in section 3.6 show strategies to learn which
information is relevant and which is not.
In Section 2.2 a facial feature based descriptor was presented,which is the pixel intensities
surrounding the localized facial features.Its intensities are normalized to have zero mean and
unitary variance to gain robustness to illumination changes.We refer to that descriptor as
a facial features patch.In this section,two more descriptors are described:a Histogram of
Oriented Gradients and SIFT.
3.5.1 Histogram of Oriented Gradients
Histogram of Oriented Gradients (HoG) was initially proposed by Dalal and Triggs [8].It is
a global descriptor (Holistic),closely related to SIFT (see Section 3.5.2) and edge orientation
histograms;designed for the human detection task.The pipeline used for their application is
depicted in Fig.3.8.
21 3.5 Face descriptor
As illustrated in Fig.3.8 the descriptor is build as follows:for an input image,the deriva
tives in x and y direction (I
x
and I
y
) are computed by convolving the image with the ﬁlters
h = [−1,0,1] and h = [−1,0,1]
⊤
respectively.Then the magnitude and direction of the deriva
tives are obtained as M(i,j) =
p
I
x
(i,j)
2
+I
y
(i,j)
2
and Ω(i,j) = arctan(I
y
(i,j)/I
x
(i,j)) in
such way that each pixel will have its gradient vector:magnitude and direction.Then accord
ing to a predeﬁned number of cells to use,the image is splitted into a grid of cells × cells
and,for each of them,a histogram is computed over the occurrence of the gradient angles of
the pixels contained in that cell.The vote for each pixel is given by its magnitude,and a soft
assignment is used,i.e.linear interpolation to share the vote among neighboring angle bins.
The next step is to normalize the histogram by using blocks of cells,i.e.a group of cells over
which their joint energy is used for normalization.Dalal and Triggs used overlaping blocks in
such way that there is redundancy over the cells being used,diﬀering only in the value used for
their normalization.
For this thesis the strategy for normalization is diﬀerent:we allowed the cells to overlap,
with the amount of overlap as a parameter,and we deﬁned three types of normalization:
• Cell:The normalization value for each cell is computed using only the information within
the same cell.This approach is highly invariant to nonuniform illumination changes,but
the relative changes in gradient magnitudes between diﬀerent cells is lost.
• Global:All the cells are normalized with the same value,which is computed globally.
In this case,the relative changes in magnitude between diﬀerent cells is maintained,but
there is poor illumination invariance.
• Block:The objective of block normalization is to provide a local,but coarser normal
ization,in such way that it is a tradeoﬀ between illumination invariance and maintaining
changes in magnitude between diﬀerent cells.The strategy is overlap dependent to com
ply with the geometry of the spatial grid,it can be used only for overlaps of 0% or 50%.
In the case of 0% overlap,the normalization value is computed combining the energy of
current cell (the one to be normalized) and 3 of its neighbors,as shown in Fig.3.9a.
In the case of 50%,the current cell is normalized using the neighbors in its diagonal.
Considering that a cell is actually 4 small squares from Fig.3.9b,the neighbors in the
diagonal are covering the area of the current cell.
colour
Normalize
gamma &
gradients
Compute
into spatial &
Weighted vote
orientation cells
Input
image
over detection
window
Collect HOG’s
SVM
Linear
non−person
classification
Person /
spatial blocks
over overlapping
Contrast normalize
Figure 3.8:Pipeline proposed by Dalal and Triggs for Human detection using HoG [8]
Chapter 3:The face recognition pipeline 22
In all of the cases the normalization used is L2,i.e.for a vector x = (x
0
,x
1
,...,x
D−1
)
⊤
the
normalized vector is obtained as x
′
= x/x,with:
x =
v
u
u
t
D−1
X
i=0
x
2
i
(3.5)
We also considered using a multiscale version of HoG.In this case,a HoG descriptor is
computed for each level of the scale pyramid.The quantity of cells for level l,denoted as c
l
,
depends on the scaling factor k,i.e.c
l
= c
0
k
−l
.As a summary,the parameters involved in the
HoG descriptor computation are shown in Table 3.1
Table 3.1:HoG parameters summary
Parameter Description
Cells Quantity of cells for the image grid
Angles Quantify of angle bins for each histogram
Overlap Fraction of overlap between neighboring cells
Sign Wether the angle range is from 0180
◦
or 0360
◦
Normalization Either cell,global or block normalization
Levels Quantity of levels for the Multiscale HoG
scaling (k) Scaling factor for each level of Multiscale
3.5.2 Scale invariant feature transform (SIFT)
The SIFT descriptor was proposed by Lowe [23],and it has proven to be very useful for object
recognition and matching applications.This descriptor is local in the sense that it describes the
region surrounding a keypoint,in a speciﬁc scale and orientation.Normally its location,scale
(a)
(b)
Figure 3.9:HoG Block Normalization (a) Zero percent overlap:the highlighted cell is
normalized using its energy plus the energy of its 3 inmediate neighbors (b) Fifty percent
overlap:the current cell is normalized using the energy of its 4 diagonal neighbors,which
are covering its area due to the overlap.
23 3.6 Learning/Classiﬁcation
and orientation are obtained from a interest point (keypoint) detector.In the case of [23],the
keypoints are obtained as spacescale extremas using Diﬀerence of Gaussians (DoG) ﬁltering.
The SIFT descriptor has the structure depicted in Fig.3.10,the idea is similar to that of
the HoG descriptor.The gradient is computed for each pixel (in the interest region) and the
area is divided in subregions (2x2 in Fig.3.10) fromwhich a histogramof gradients is computed
by using the magnitude of the gradient as the vote for the angle bins.However,it is important
to remark that previous to the histogram computation,a Gaussian weighting is applied to the
magnitude,centered in the middle of the descriptor with σ equal to one half of the width of
the descriptor.This will give less importance to the pixels in the extremes of the area,and
therefore,reduce the eﬀect of misalignments.In this thesis,we used SIFT descriptors with 4x4
subregions,each of 8 angle bins,generating a 128 dimensional descriptor.
Image gradients
Keypoint descriptor
Figure 3.10:SIFT Descriptor structure [23].
3.6 Learning/Classiﬁcation
Each image i is represented by a descriptor vector x
i
∈ R
D
.The vector x
i
is associated also
with a categorical label y
i
corresponding to the person identity.A classiﬁcation algorithm,for
face recognition,models the binary decision of whether,images x
i
and x
j
,belong to the same
class (y
i
= y
j
),or not (y
i
6= y
j
),as shown in Eq.(3.6)
f(x
i
,x
j
):R
D×2
→{0,1} (3.6)
In the following sections,relevant algorithms for classiﬁcation are described.
3.6.1 Spectral regression kernel discriminant analysis
Kernel discriminant analysis (KDA) is an extension of the linear discriminant analysis (LDA)
to handle nonlinear data.In the case of LDA,it is asssumed that the data for each class follows
Chapter 3:The face recognition pipeline 24
a normal distribution with equal covariance.The goal is to solve Eq.(3.7)
W
opt
= arg max
W
Tr{(W
⊤
S
B
W)
−1
(W
⊤
S
W
W)} (3.7)
Eq.(3.7) ﬁnds the optimal combination of features which separates the input data according
to their classes.The objective function is such that the between class covariance S
B
is maxi
mized and the within class covariance S
W
is minimized.These terms are deﬁned in Eq.(3.8)
and Eq.(3.9) respectively.
S
B
=
c
X
i=1
N
i
(µ
i
−µ)(µ
i
−µ)
⊤
(3.8)
S
W
=
c
X
i=1
X
x
k
∈X
i
(x
k
−µ
i
)(x
k
−µ
i
)
⊤
(3.9)
Where N
i
and µ
i
is the number of points and the mean for class i,and µ is the mean for all
the data,independently of the class,and X
i
is the subset of points that belong to class i.LDA
can be described as an algorithm that ﬁnds,an optimal linear projection,such that the data
belonging to the same class will be moved closer,and the data belonging to diﬀerent classes
will be pushed appart.
In [2] it is shown the problem can be reformulated in terms of inner products.Therefore
the Kernel trick can be used to handle nonlinear data,which leads to the KDA algorithm.For
this thesis we used an instance of KDA called Spectral Regression Kernel Discriminant Analysis
(SRKDA),from the work of Cai et al.[6].It is a speciﬁc formulation of KDA in which the
optimization process is theoretically 27 times faster.The limitation of SRKDA is that the
target space is limited to be of c −1 dimensions,where c is the number of classes.
3.6.2 Logistic regression
General logistic Regression
Logistic regression [4] models the probability of a feature vector x
i
to belong to a class as a
logistic sigmoid function.Its argument is a linear combination of the entries of the feature
vector.This is shown in Eq.(3.10).
p(y
i
= 1x
i
) = σ(w
⊤
x
i
),(3.10)
where σ(z) = (1 +exp(−z))
−1
is the sigmoid function,and x
i
is given in homogeneous coor
dinates,i.e.allows for a bias term to be learned in w.Taking the negative loglikelihood (Eq.
25 3.6 Learning/Classiﬁcation
(3.11)) and its gradient (Eq.(3.12)) the optimal weights can be obtained by using a gradient
descend algorithm until convergence (ﬁnding the minimum negative loglikelihood).
L = −
X
n
t
n
lnp
n
+(1 −t
n
) ln(1 −p
n
) (3.11)
▽L =
X
n
(t
n
−p
n
)x
n
(3.12)
Logistic discriminant metric learning
The objective of metric learning algorithms is to ﬁnd,the matrix M∈ R
D×D
,such that the
Mahalanobis distance,Eq.(3.13),is minimized for positive examples (y
i
= y
j
),and maximized
for negative pairwise examples (y
i
6= y
j
).
d
M
(x
i
,x
j
) = (x
i
−x
j
)
⊤
M(x
i
−x
j
),(3.13)
where M is restricted to be positive semideﬁnite
1
.Logistic Discriminant Metric Learning,
proposed by Guillaumin et al.[16],model the probability of two examples to depict the same
person as given by Eq.(3.14).
p
n
(y
i
= y
j
x
i
,x
j
;M,b) = σ(b −d
M
(x
i
,x
j
)),(3.14)
where σ(z) = (1 +exp(−z))
−1
is the sigmoid function and b is a bias value.Let n be an index
representing the pair ij.From Eq.(3.14),the likelihood over the seen data,taking t
n
as the
target class for pair x
n
= (x
i
,x
j
),is given in Eq.(3.15).
L =
N
Y
n
p
t
n
n
(1 −p
n
)
1−t
n
(3.15)
From which it can be shown that the negative log likelihood,and its gradient are given in
Eq.(3.16) and Eq.(3.17) respectively.
L = −
X
n
t
n
lnp
n
+(1 −t
n
) ln(1 −p
n
) (3.16)
▽L =
X
n
(t
n
−p
n
)X
n
(3.17)
X
n
is deﬁned as the vectorization of (x
i
−x
j
)(x
i
−x
j
)⊤.Using Eq.(3.16) and Eq.(3.17) it
1
A Matrix M∈ R
D×D
is positive semideﬁnite if x
T
Mx ≥ 0,∀x 6= 0.It is denoted as Mº 0
Chapter 3:The face recognition pipeline 26
is possible to learn the values of Mby minimizing the negative loglikelihood using a gradient
descent algorithm.If the matrix is restricted to be positive semideﬁnite,then a Cholesky
decomposition can be applied to it,i.e.M= LL
⊤
.In this case Eq.(3.13) can be reformulated
as in Eq.(3.18)
d
L
(x
i
,x
j
) = (L
⊤
x
i
−L
⊤
x
j
)
⊤
(L
⊤
x
i
−L
⊤
x
j
) (3.18)
This result can be interpreted as a projection of the data followed by the computation of
the Euclidean distance in the new space.Throughout this thesis,logistic discriminant metric
learning will be used as the main learning algorithm.
3.7 Datasets and evaluation
In order to evaluate the performance of our algorithm,two datasets are used:Labeled Faces in
the Wild (LFW) and Public Figures (PubFig).In this section a description of both datasets
together with their evaluation protocol is presented.
3.7.1 Labeled faces in the wild
The main dataset used for this project is called Labeled Faces in the Wild (LFW) [19].An im
portant dataset due to its high variability in pose,expression,illumination conditions,etc.and
therefore,considered to be appropriate to evaluate face recognition approaches for uncontrolled
settings [30].Consist of 13233 images retrieved from Yahoo!News using a ViolaJones face
detector.With a resolution of 250 ×250,the scale and location of each face is approximately
the same,therefore there is no need to use a face detector.Each image is labeled according to
the person identity to give a total of 5749 identities.The quantity of images per person varies
from 1 to 530.
To redirect the research eﬀorts towards algorithms of recognition more than alignment,there
are three versions of LFWavailable:
• Not
Aligned:the set of images as taken directly from the face detector.
• Aligned
Funneled:aligned using the algorithm described in section 3.3.1.
• Aligned
Commercial:aligned using the algorithm introduced in [43].
In order to have a standard evaluation method to properly compare diﬀerent algorithms,
a protocol was established.Ten independent subsets (folds) of images were deﬁned,mutually
exclusive in terms of image exemplars and identity.The evaluation protocol allows for two
27 3.8 Baseline performance
diﬀerent paradigms:restricted and unrestricted.For the restricted case,a set of 600 pairs are
predeﬁned for each of the ten folds,each pair has an associated label which indicates whether
the images belong or not to the same person,300 pairs for each case.In this case the identity
must not be used,i.e.no more pairs can be created.In the unrestricted paradigm,the identities
can be used,so that a large quantity of pairs can be created.
For both cases,performance is reported as the mean over 10fold cross validation.This
means that one of the 10 folds is held out,and the training is done using the remaining subsets,
then the accuracy is obtained by classifying the “unseen” 600 pairs that were left aside.This
is done 10 times,rotating over the diﬀerent folds and the ﬁnal report is the mean and standard
deviation of the accuracy over the 10 folds.In this work we will focus in the unrestricted
paradigm.
3.7.2 Public ﬁgures (PubFig)
The Public Figures dataset was compiled by Kumar et al.[22] and it is larger than LFW.It
consist of 59470 images of 200 people,collected from the internet.Therefore there are many
more images per person than in LFW.Similarly to LFWit contains a large variability in pose
variations,illumination,expression,etc.
An important diﬀerence with LFWis that images are given as a list of URL addresses,from
diﬀerent sources of the internet.That represents a problem as through time some images will
be lost.That was conﬁrmed when we retrieved the dataset,15% of the URLs were invalid and,
as a consequence,25% of the test pairs could not be created.
The evaluation protocol is 10 fold cross validation using a “restricted” paradigm equivalent
to that of LFW,and therefore,no additional pairs can be used to train the algorithm.Diﬀerent
benchmarks to measure the performance of the algorithmunder speciﬁc conditions are provided,
e.g.the behavior using only frontal pose images,or only using neutral expressions,etc.
In our evaluation,we use the dataset as an “unrestricted” paradigm,deﬁning our own pairs
for training,but using the benchmarks test pairs for evaluation.
3.8 Baseline performance
Our baseline algorithm is the following:facial features are detected (see section 3.2),using the
found coordinates,two feature vectors are build.The ﬁrst vector is formed by the concatenation
of SIFT descriptors,obtained from three diﬀerent scales (16,32 and 48 pixels width) at the
location of each facial feature (following [16]).The other case is the concatenation of the facial
feature patches fromsection 2.2.The implementation was done in Matlab,and computationally
expensive sections such as alignments or feature extractions were implemented in C.
Chapter 3:The face recognition pipeline 28
Table 3.2 show results obtained for both descriptors in the Aligned Commercial version
of LFW.For comparison,two classiﬁers are used,the Euclidean distance between the feature
vectors of the pair of images being classiﬁed,and using LDML to learn a proper Metric.
It can be observed the signiﬁcant contribution of Metric Learning approaches for face recog
nition.Additionally,when Euclidean distance is used for classiﬁcation,there is no signiﬁcant
contribution of using SIFT descriptors from facial feature patches.The diﬀerence is only ob
served when a proper metric is used.
Table 3.2:Baseline algorithms performance
Classiﬁcation
Facial Feature Patches
Multiscale SIFT
Euclidean Distance
0.6702 ±0.0031
0.6845 ±0.0051
Logistic Discriminant Metric Learning
0.7385 ±0.0042
0.8524 ±0.0052
Chapter 4
Histogram of Oriented Gradients
for face recognition
4.1 Motivation
Facial feature based approaches have gained popularilty in the past years,due to their ro
bustness regarding pose variations,in comparison with holistic approaches.However,the per
formance of the face recognition is strongly dependent on the accuracy of the facial feature
detection.Facial feature localization algorithms,even if they have gained signiﬁcant improve
ments,are still not able to cope with large pose variations.Besides,the computation time is
high,in order to maximize the objective function within the set of possible locations,Eq.(3.2).
For those reasons,it is desirable to have a pipeline without facial feature detection.
There is also the intuition that holistic approaches will provide more information to the
learning process,which might give a higher discrimination power to the overall algorithm.
Therefore,a Histogram of Oriented Gradients (HoG) descriptor,a holistic encoding,was im
plemented following the description from Section 3.5.1.The programming language for the
implementation was C,using the OpenCV library [5].Assuming the input image’s resolution
is 250x250,the descriptor is created for the 100x100 pixel region in the center of the image.It
is important to dismiss the background in order to reduce biases the dataset might have [31].
The objective is to ﬁnd a set of parameters such that the discriminative power of the descriptor
is suitable for face recognition
29
Chapter 4:Histogram of Oriented Gradients for face recognition 30
4.2 Alignment comparison
It is important to decide whether the use of an alignment is imperative for holistic approaches,
more speciﬁcally,for the use of a HoG descriptor.To answer this question,we compare the
three variants of LFW:Not Aligned,Aligned Funneled and Aligned Commercial,and using the
same parameters for the HoG descriptor.
The ﬁrst results are shown in Table 4.1.It reveals,in a consistent manner,that an alignment
is crucial for face recognition using HoG.It seems interesting that the funneled version of LFW
did not show any improvement over the not aligned version,in fact there is a decay.For that
reason,we ran a face alignment using the location of the facial features (c.f.Section 3.3.2).
It can be seen that this boost the results signiﬁcantly,for the not aligned as for the aligned
funneled version,with an increase of over 5%,while there is an insigniﬁcant decrease in the case
of the aligned commercial version.Though it is not reported here,in our experiments,we did
not observe a signiﬁcant diﬀerence in accuracy between any of the LFWversions when using a
facial feature based descriptor.
These results brings two conclusions:a face alignment is indeed crucial for the use of HoG
descriptors.However,as suggested by the decrease of accuracy of the funneled version with
respect to the not aligned version,the alignment should be robust not only in terms of rotation
and scale,but more important,to translation.We believe that funneling is not as robust in
translation as a feature based alignment.
It is intuitive to have a need for robust alignment regarding translation,as it is desirable for
the corresponding features to fall in the same spatial cell.Once a parametric study was done for
HoG (Section 4.3),the same experiment was performed using the best set of parameters (Table
4.9).The results are shown in Table 4.2 which conﬁrms the previous behavior.A disadvantage
of this result is that,even if the descriptor is holistic,there will be a need for a facial feature
detector prior to its computation.This will inherit the problems caused by the detector.
Table 4.1:Alignment Comparison for an initial set of parameters for a HoG descriptor:12x12
cells,16 angle bins,range [0360]
◦
,50% overlap with block normalization
LFWvariants
Not Aligned
Aligned Funneled
Aligned Commercial
No further Alignment
0.7568 ±0.0053
0.7408 ±0.0067
0.8205 ±0.0063
Feature based alignment
0.8069 ±0.0066
0.8093 ±0.0063
0.8171 ±0.0047
31 4.3 HoG parametric study
Table 4.2:Alignment Comparison for the ﬁnal set of parameters:16x16 cells,16 angle bins,
range [0360]
◦
,50% overlap with global normalization
LFWvariants
Not Aligned
Aligned Funneled
Aligned Commercial
No further Alignment
0.7660 ±0.0061
0.7702 ±0.0042
0.8432 ±0.0062
Feature based alignment
0.8276 ±0.0051
0.8383 ±0.0054
0.8357 ±0.0058
4.3 HoG parametric study
We perform a parametric study for a Histogram or Oriented Gradients based face recognition.
The evaluation follows the protocol established for LFW,i.e.evaluation using 10 fold cross
validation,and the results are reported as the mean and standard deviation of the accuracy
over the 10 folds.Unless speciﬁed,the dataset used is LFW aligned commercial and the
learning algorithm is LDML.As a search for the optimal parameters,considering all possible
combinations,is almost intractable,we decided to optimize parameters one by one.
4.3.1 Angle range
As a ﬁrst experiment we studied the eﬀect of the angle range over the performance of the
algorithm.To do that we set the rest of the parameters to a ﬁxed value:8 angle bins,as
used for the SIFT descriptor [23],8x8 cells and 50% overlap,using a block normalization.The
experiment was repeated for the three variants of LFWto compare the results.
It can be observed,fromTable 4.3,that a range of [0360]
◦
outperforms the range of [0180]
◦
,
when combined with LDML.This is consistent for the three variants of LFW.Therefore,in the
following experiments the default is a signed angle,i.e.a range of [0 −360]
◦
.
4.3.2 Normalization
The three variants for normalization are described in Section 3.5.1.These are cell,block and
global normalization.Fig.4.1 show examples of HoG descriptors,plotted over the original
image.For cell normalization,as the norm is the same for each spatial bin,the relative changes
Table 4.3:Angle range comparison for HoG.8x8 cells,8 angle bins,50% overlap and block
normalization
LFWvariants
Angle Range
Not Aligned
Aligned Funneled
Aligned Commercial
[0 −180]
◦
0.7150 ±0.0053
0.7077 ±0.0052
0.7563 ±0.0082
[0 −360]
◦
0.7523 ±0.0071
0.7495 ±0.0054
0.8017 ±0.0066
Chapter 4:Histogram of Oriented Gradients for face recognition 32
(a)
(b)
(c)
Figure 4.1:HoG Normalization examples (a) Cell normalization (b) Block normalization and
(c) Global Normalization
Table 4.4:Normalization comparison for the HoG descriptor.Parameters:16 angle bins,range
[0360]
◦
Number of cells/Overlap(%)
12/0
12/50
16/0
16/50
Cell
0.7933 ±0.0061
0.8128 ±0.0077
0.7578 ±0.0091
0.8178 ±0.0061
Block
0.8192 ±0.0064
0.8305 ±0.0064
0.8291 ±0.0058
0.8385 ±0.0074
Global
0.8247 ±0.0071
0.8283 ±0.0068
0.8317 ±0.0056
0.8432 ±0.0062
in magnitude between diﬀerent cells is lost,which will diminish the inﬂuence of strong gradients.
However it will be very robust to non uniform changes in illumination.In the case of global
normalization,the important gradients,that appear from regions such as the eyes,mouth and
nose,will be emphasized at the cost of a weaker resistance to illumination changes.Block
normalization is the tradeoﬀ between cell and global paradigms.
A experiment was performed in which the parameters were left unchanged,except for the
normalization type,overlap and the number of cells.The results,found in Table 4.4,show
consistenly that cell normalization give the worst performance.Global normalization leads to
similar results as block normalization.In most of the cases global is better except for for 12
cells with 50% overlap.Because of these results,and for its simplicity of computation,we take
global normalization as the default for further experiments.The exception is the quantity of
cells experiment,which was computed in parallel.
33 4.3 HoG parametric study
4.3.3 Quantity of cells
Another important parameter to determine is the quantity of cells.Table 4.5 show experiments
we performed changing only this parameter.Here we used 16 angle bins over a signed range,
i.e.[0360]
◦
,using 0% overlap and global normalization.It can be observed that for more than
14 cells there is not a signiﬁcant variation and below that value,the results start to degrade.A
reason,why above 14 cells there is no improvement in performance,might be because LDML
start to combine the information of ﬁner cells as if they were coarser.More cells will not bring
any improvement,but only generate larger descriptors,e.g.there is not a signiﬁcant diﬀerence
in performance between 16 ×16 and 20 ×20,however for 20 cells the descriptor size is almost
doubled compared to 16 cells.Therefore,we decided to set 16 cells as our default value.
Table 4.5:Number of cells comparison for the HoG descriptor.16 angle bins,range [0360]
◦
,
0% overlap with block normalization
Number of cells
Accuracy
10
0.8198 ±0.0086
12
0.8305 ±0.0064
14
0.8327 ±0.0080
16
0.8385 ±0.0074
18
0.8348 ±0.0059
20
0.8412 ±0.0060
22
0.8380 ±0.0068
4.3.4 Angle bins
Angle bins refer to the quantity of partitions in which the angle range is split.Experiments
were done to compare how is the performance aﬀected by modifying the quantity of angle bins
per cell.The results can be found in Table 4.6,it can be noticed the maximum is found at 16
bins,therefore it is taken as the default for further experiments.
Table 4.6:Accuracy obtained using diﬀerent angle bins for the HoG descriptor.Parameters
16x16 cells,range [0360]
◦
,0% overlap with global normalization
Angle bins
8
12
16
20
0.8230 ±0.0049
0.8270 ±0.0052
0.8317 ±0.0046
0.8295 ±0.0077
Chapter 4:Histogram of Oriented Gradients for face recognition 34
4.3.5 Overlap
In Table 4.7 is shown the variation in accuracy as a function of the overlap,when the rest
of parameters are left unchanged.The maximum in accuracy was obtained for the case the
overlap is of 50%,corresponding to 0.8432 ± 0.0062.However,it is not highly aﬀected for a
range between 10% and 60%.
It is important to remark that the cell size in pixels is a function of the the overlap when the
image size remains ﬁxed.Therefore to show that overlap is beneﬁcial,an additional experiment
was done:a 9x9 cells descriptor was created with no overlap.In this case,the cell size is
similar to that of 16x16 cells using 50% overlap (≈ 11 pixels).The accuracy obtained was
0.8207 ± 0.0080,which is lower than using overlap.We argue that overlap is beneﬁcial as it
helps to correct misalignments due to problems in face detection or pose variations.
Table 4.7:Overlap comparison.Parameters 16x16 cells,16 angle bins in the range [0360]
◦
,
using global normalization
Overlap (%)
0
12.5
25
37.5
50
62.5
75
0.8317
0.8423
0.8392
0.8412
0.8432
0.8415
0.8333
±0.0056
±0.0045
±0.0054
±0.0064
±0.0062
±0.0066
±0.0052
4.3.6 Multiscale HoG
We also studied a multiscale HoG descriptor,in this case there are two parameters involved:
the number of scales and the rescaling factor.The results from Table 4.8 show that the use of a
multiscale approach does not bring any signiﬁcant contribution to the performance.The reason
might be related to the fact that a coarser level of the pyramid is only a linear combination of
the ﬁner cells.This will cause LDML to ignore coarser levels,as the information of the ﬁnest
level of the pyramid is enough.
Table 4.8:Multiscale HoG performance
Levels/k
Number of cells
12
14
16
18
2/1.15
0.8317 ±0.0067
0.8407 ±0.0065
0.8435 ±0.0074
0.8425 ±0.0074
2/1.30
0.8287 ±0.0067
0.8375 ±0.0059
0.8380 ±0.0047
0.8453 ±0.0068
2/1.45
0.8355 ±0.0056
0.8388 ±0.0068
0.8413 ±0.0063
0.8410 ±0.0074
3/1.15
0.8322 ±0.0074
0.8423 ±0.0061
0.8398 ±0.0062
0.8397 ±0.0063
3/1.30
0.8312 ±0.0057
0.8383 ±0.0065
0.8397 ±0.0062
0.8435 ±0.0070
3/1.45
0.8312 ±0.0059
0.8360 ±0.0073
0.8440 ±0.0057
0.8420 ±0.0058
35 4.4 Discussion
4.4 Discussion
The conclusion of this study was the identiﬁcation of appropriate parameters for face recogni
tion.The descriptor to be used will have 16x16 cells as the spatial grid,with an overlap of 50%,
the angle histograms are created using 16 bins,which represent a range from 0
◦
to 360
◦
,the
voting is done using soft assignment by linear interpolation.There is no need for a multiscale
descriptor when using LDML as the classiﬁcation algorithm.
Further improvements could be achieved by reducing high diﬀerences of occurrence between
certain angle bins.For example,it is expected for regions around the mouth to always have a
high occurrence of horizontal lines.Therefore a large fraction of the feature vector energy will
be distributed over the angle bins corresponding to those gradients,shadowing other bins with
less occurrence.This problem is one of the main motivations for the work of Cao et al.[7],as
this concentration of energy reduces the discriminative power of the descriptor.
A simple way to balance the energy is by deﬁning new descriptors x
′
by simply computing
the square root of the input descriptors,i.e.x
′
= (
√
x
0
,
√
x
1
,...,
√
x
D−1
)
⊤
.This is similar
to the computation of the Hellinger distance d(x,y) =
P
i
(
√
x
i
−
√
y
i
)
2
,but extended to
handle interfeatures correlation through the Mahalanobis distance.The eﬀect of doing such
test brought the results from0.8432±0.0062 up to 0.8530 ±0.0065 for the aligned commercial
version of LFW.Notice that by using this method,the conclusion drawn for SIFT multiscale
might not hold,as coarser cells would not be the linear combination of ﬁner cells.
This result suggest that it would be interesting to study diﬀerent strategies to distribute
the energy of the descriptor.For example,instead of computing the square root,a parameter
γ ∈ [0,1] could be used to create a new feature vector x
′
= (x
γ
0
,x
γ
1
,...,x
γ
)
⊤
.This is a
generalization of the square root vector.
Table 4.9:Best found parameters for HoG based recognition
Parameter Description
Cells 16
Angle bins 16
Comments 0
Log in to post a comment