# Face Recognition using Multivariate Statistical Techniques

AI and Robotics

Nov 17, 2013 (4 years and 7 months ago)

64 views

Face Recognition using Multivariate Statistical Techniques

Introduction:

Face recognition by digital computers has obtained a lot of attention in recent years.

There are many practical applications to such a technique. Face recognition can be used
in c
riminal identification, in security systems, and to increase the interaction between
humans and computers. Face images are complex. Added to this complexity face images
come with a variety of light effects, backgrounds, etc. Including all these variations
in a
model is a difficult task. Hence face recognition is a challenging task. The methods
followed here are simple and are based on finding a set of faces (very few) that explain
most of the variance in the data set and using different classification rules

based on these
few faces. A comparison of the methods adopted for classification is also done.

The faces analyzed are obtained from the ORL database of faces at
http://www.uk.research.att.
com/facedatabase.html

.
This website provides the following
The database contains a set of face images taken
between April 1992 and April 1994 at the AT&T laboratories. The database was
originally
used in the context of a
face recognition project carried out in collaboration
with the Speech, Vision and Robotics Group of the Cambridge University Engineering
Department. There are ten different images of each of 40 distinct subjects. For some
subjects, the images were taken at

different times, varying the lighting, facial expressions
(open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses). All the
images were taken against a dark homogeneous background with the subjects in an
upright, frontal posit
ion (with tolerance for some side movement)

. Th
is makes a total of
400 face images
, with 10 images corresponding to each person. The resolution of each
image is 112x92 pixels.
All the images were in
.pgm

image format. All the images are
grayscale images;
the data in each image consists of pixel intensity values ranging from 0
to 255, with 0 corresponding to black and 255 to white. Some images from the data set
are shown below:

Initial steps:

The data are first unpacked from the
.pgm

image format as fo
llows:
for any image,
all the
pixel intensity values are read continuously starting from pixel position (1,1) by going
column wise until a new row, then the pixel values of this row are read and finally all the
values are made into a row vector. All the im
ages are unpacked in this fashion which
gives a data matrix
X

of 400x10304 dimensions. (i.e. 400 images and each image having
112x92=10304 pixels). Hence, each image can be considered to be a vector in the 10304
dimensional space.

For statistical analysis
,
the pixels (columns

of the data matrix
) can be considered as
random
variables and different images

(rows

of the data matrix
) can be considered as
samples of these variables.

Most of the variables are expected to be very highly
correlated. This is because

most of the pixels which form the background will have the
same intensity in an image. So, they would be highly correlated. Similarly, a lot of pixels
that make up the hair would also be correlated. So, there is a lot of redundancy in the
data.

Basic Ide
a:

The objective is t
o correctly identify the individual given an image of his/her face.

To this
end, one of the classification procedures mentioned below is desired to be followed.

1)

Linear discriminant analysis (LDA) assuming multivariate normality of f
eatures
given groups and common covariance.

2)

Fisher discriminants procedure which does not assume multivariate normality of
features given groups but assumes common covariance.

A brief description of each of the methods mentioned above follows:

Method 1
:
The method, based on the assumptions, obtains the posterior probabilities in
terms of the spooled covariance matrix of the groups and the individual group means.
The classification rule is then based on these posterior probabilities. The group which
minimi
zes the expected loss of misclassification is chosen by the classifier. The method
is very much dependent on the assumptions mentioned. One can say that this method is
very sensitive to deviations from the assumptions. (Equal losses are assumed).

Method 2
:
This method is based on Fisher’s discriminants. The discriminants obtain a
low dimensional representation of the data, just as in PCA. The difference here is that, the
low dimensional space is chosen so that the groups are separated from each other as
mu
ch as possible (based on certain criteria). The classification rule then is based on the
distance between a new observation and the individual group means in this dicriminant
space. The classifier chooses the group whose mean is nearest to the new observat
ion.
The maximum number of discriminants obtained is given by s=min(g
-
1,p) where g is the
number of groups and p is the number of variables in each group.

In the two methods the prior probabilities are all assumed to be equal, this is reasonable
because
the probability of any subject being subject to identification (recognition) is
equal. It turns out that the classification procedure based on Fisher’s dicriminants is the
same as linear discriminant analysis assuming multivariate normality of features giv
en
groups, if the prior probabilities are all the same and if all the discriminants are used for
classification (Johnson & Wichern pg.694 [2]). Hence, method 2 and method 1 give
equivalent classification rules under such conditions.

Even though, theoretic
ally
method 1

can be performed on the raw pixel data, which is
400x10304 in dimensions computationally it is not feasible to estimate the covariance
matrix which would be of 10304x10304 dimensions. This would require huge amounts of
memory. A way out, woul
d be to consider using PCA to reduce the dimensionality before
applying LDA. Also, as the principal components are a linear combination of 10304
variables they would be closer to normality. This follows from the central limit theorem
which states that a li
near combination of many variables which are
independent

and
having any distribution, is normal distributed. Even though the variables are not
independent here, the principal components would still be close to normality as would be
shown later. Hence using

PCA we can reduce dimensionality and increase the
approximation to normality at one shot.

The two methods are discussed in detail below.

Method 1
:

We want to classify the individuals
based on their face images by Linear Discriminant
A
nalysis (LDA).
Ea
ch 10304 pixel intensity vector can be thought of as a vector of
features and each individual can be thought of as a class into which the features are
LDA

cann
ot be performed on the
raw pixel intensity va
lues

(can be thought of
features) as the covariance matrix is of large
dimensions

(10304x10304)
. Computing the covariance matrix and its inverse are b
oth
computationally not feasible.

To overcome this problem Principal Component Analysis (
PCA
)

is first p
erformed on the
pixel
intensity data
.
But, to get the principal components of
X

which is 400x10304 we
have to obtain the covariance matrix which is of 10304x10304 dimensions and then find
its eigenvectors.
So, the problem still remains.
Nevertheless, we ha
ve to note that the rank
of
X

is only 400, which means there will only be
400 eigenvectors

which have non
-
zero
eigenvalues.

A simple way of finding these eigenvectors is proposed by (
Turk and Pentland

[1]). If
v

is

an eigenvector of
XX
T

then an eigenvecto
r of
X
T
X

is
X
T
v
. Mathematically, if
v

is an
eigenvector of
XX
T

then

v
v
XX
T

Multiplying both sides by
X
T

we get

)
(
)
(
v
X
v
X
X
X
T
T
T

Hence
X
T
v

is an eigenvector of
X
T
X
.

Note that
X
, as used here, is mean centered. The mean of
X

(before

mean centering)
i.e.
the mean of all faces is called the
average face
. The average face is shown below,

Figure 8

The average face

Also, note that the eigenvalues do not change i.e. the eigenvalues of
XX
T

are the same as
those of
XX
T
. The eigenvectors cal
culated this way are normalized. The order of the
covariance matrix from which eigenvectors have to be estimated is reduced from
10304x10304 to 40
0
x40
0

(i.e. from square of pixels to square of number of images). The
eigenvectors and eigenvalues are estimat
ed using MATLAB. The larger an eigenvalue
the more important is that eigenvector, in the sense that the variance of the data set in that
direction is more compared to the variance in the direction of eigenvectors with lesser
eigenvalues.

These eigenvecto
rs are called
eigenfaces
(
Turk and Pentland
[1]). The components of each
eigenvector can be regarded as a weight of the corresponding pixel in forming the total
image. The components of the eigenvectors calculated, generally do not lie between 0
-
255, which i
s required to visualize an image. Hence, the eigenvectors are transformed so
that the components lie between the 0
-
255 range, for visualization. These eigenfaces can
be regarded as the faces that make up all the faces in the database i.e. a linear
combinat
ion of these faces can be used to represent any face in the data set. Also, at least
one face can be represented by a linear combination of all the eigenfaces. The first few
eigenfaces (after transformation) are shown below:

Figure 9

Eigenfaces

The eige
nfaces represent vectors in the 10304 dimensional space spanned by the
variables (pixels). If we consider these new vectors to be our new axes to represent the
data, these new axes are the principal components we are seeking for. They are just a
rotation o
f the original axes to more meaningful directions.
To illustrate how principal
components are closer to normality, the normal probability plots of some of the first few
principal components are given. To check the multivariate normality of the principal
co
mponents we check for univariate normality of individual principal components
(PC’s). The normal (unconditional) probability plots of some of the important PC’s are
shown below, and for comparison the normal probability plots of the raw variables are
shown

as well

Figure 1

Normal probability plot of the first pixel

Figure 2

Normal probability plot of the second pixel

Figure 3

Normal probability plot of the third pixel

Figure 4

Normal probability plot of PC 5.

Figure 5

Normal probability plot

of PC 2

Figure 6

Normal Probability plot of PC 1

Figure 7

Normal probability plot of PC 9

The plots
(Figures 1, 2, 3)

show clearly that the
unconditional distribution of the raw
pixels is not normal. Actually we want to test whether the conditiona
l distribution of the
pixels given a class is multivariate normal or not, but instead we test for the unconditional
distribution. This is because each class has only 10 data points (10 face images), and not
much could be read from the conditional plot with

such a small amount of data.
Also, it is
unreasonable to expect huge diffe
rences in conditional densities. As the unconditional
distributions look normal, the conditional distributions might not be far removed from
normality. Hence we can test for the unc
onditional distribution instead of the conditional
one.

The plots (Figures 5, 6, 7) clearly show that the principal components are closer to
normality than the raw pixels.
Also, all the principal components are uncorrelated

(by
definition) with each other
. Hence all the new features

(principal c
omponents) will be
independent,
because uncorrelated normal random variables are independent.

Hence the
features are multivariate normal i.e. f(
x
) has a multivariate normal density, where
x

is the
vector of principa
l components. It has already been said that the distribution of various
pixels across the classes does not vary much hence we can assume f(
x
|y) to be close to
multivariate normal too.

The first few eigenfaces clearly separate the face from hair and the ba
ckground and as
such explain most of the variance in the dataset. The distinction between the face, hair
and the background becomes blurred as we go along. The following plot shows the %
variance explained versus the number of principal components.
The firs
t few principal
components are selected which explain the va
riance of the data set upto 75% (the basis
for this is explained later)
.

Figure 10

A scree plot for the first few principal components

The %variance explained of any principal component is cal
culated by the ratio of the
corresponding eigenvalue to the sum of all the eigenvalues. The number of principal
components to be considered is based on how good a classification one obtains using
them. Considering just five components based on the scree pl
ot does not give good
classification.

Projections

(Reconstruction of an Image)

The projection of a vector

v

onto a
vector
u

is given by:

(Proj)
v
u

=
u
u
u
u
v

,
,

If
v

is any image (after subtracting the
average face
) and
u

is any eigenface, the
n the
above formula gives the projection of the face
v

on to the eigenface
u
. Any image can be
projected onto each of the first few
eigenfaces.
All the projected vectors are added to get
a final image (reconstructed). After this, the average face has to be

actual projected face onto the lower dimensional space spanned by the eigenfaces.

As a
n example,
a face and it
s pro
jection into 5, 20, 50, 150, 400

dimensions (
i.e. using
5,10,15,20,400

eigenfaces respectively) is shown below:

F
igure 11

Projected Faces

Any image in the database can be completely reconstructed by using all the eigenfaces.
The last image above is a complete reconstruction of the original image.

Performing
LDA
:

With the above basic frame work and with the addit
ional assumption of equivalence of
the covariance matrices for each group we can use
LDA

for classification. The
assumption of equivalence of covariance matrices is quite strong. No statistical check has
been performed to check this assumption, as most of
these tests are not robust enough
-
fold cross validation has been performed to check the error
rate on the final classification and the assumptions are validated based on these results.

The standard procedure for LDA was fol
lowed. The covariance matrix has been
estimated using a spooled estimator. The prior probabilities have been considered to be
equal. With this spooled estimator and the means of the individual groups, the
y

’s
and
y

’s for each group have been calculated by

y
y
y

1
'
2
1

y
y

1

The classification rule (Bayes’s minimum cost rule) using equal losses then becomes,

x
x
j
P
x
y
P
x
y
P
y
j
I
x
j
j
j
y
J
y
'
1
max
arg
)
|
(
max
arg
)
|
(
min
arg
)
|
(
)
(
min
arg

Where x is a features vector, J is the total number of c
lasses (=40). The classification
obtained depends on the number of principal components considered. The number of
principal components to be used for classification is chosen so that the APER (apparent
error rate) is almost 0%. Using 23 principal component
s, it was found that the apparent
error rate (APER) is 0.02%.The APER is 0% when 60 or more principal components are
considered. A closer estimate of the AER (actual error rate) can be calculated using
two
-
fold cross validation
. These were calculated based

on the confusion matrices. The
confusion matrices are not shown here as they are of 40x40 dimensions. Instead the final
cross validated error rates are shown. The following table shows the number of principal
components used and the AER:

No of PC’s

6
M

㄰N

䕲牯r⁲a瑥‥

㄰⸲N

ㄱ⸵

Cross validated error rates

Table 1

It can be seen that the error rate does not decrease much with increase in the number of
principal components. In fact, the error rate when 100 PC’s are used is more than when
only 23 P
C’s are used. This can be attributed to the larger rounding errors when dealing
with inverses of large dimension covariance matrices. (When using 100 PC’s the spooled
covariance matrix becomes 100x100). As the error rate does not decrease much with
increas
ing the number of PC’s, 23 PC’s can be considered to be optimal. These 23 PC’s
explain the variance of the data set upto 75%. Also, as the APER as well as AER is quite
small we can assume that our assumptions of multivariate normality of features given a
g
roup and the equivalence of covariance matrices are not unreasonable.

Method 2
:

The Fisher discriminants are the eigenvectors of
Σ
-
1
B
.
Where,
Σ
is the common
covariance matrix and
B

is the between groups matrix. As in method 1, a spooled
estimator of t
he covariance matrix
Σ

is used.
Again
, this method is also performed
after

reducing the dimensionality using
PCA
.
This method is based on simple metric criteria.
Mathematically, the classification rule chooses class j if

2
1
'
r
l
j
l
x
x
a

is minimum
. Where r is the total number of discriminants used, x is the
new observation and
j
x

is the mean of group j i.e. the class, whose mean is closest to the
new observation (in a squared distance sense), is chosen.

tioned that method 2 is the same as method 1 if all Fisher
discriminants are used and if all the groups are assumed to have equal prior probabilities.

To check if this is true all 23 discriminants are chosen and the observations are classified.

The error
rate obtained was 12.5% as shown in the last entry of table 1. This is
approximately the same as the error rate obtained from LDA. Further, to check the
efficiency of this method the number of Fisher discriminants as well as the number of
PC’s considered w
as varied. The tables below show the cross validated error rates in each
case:

No.of
discriminants

10

15

23

Error rate %

13.25

13.5

12.5

Cross validated error rates u
sing 23 PC’s

Table 2

No.of
discriminants

15

30

60

Error rate %

11

12

11

Cross va
lidated error rates using 60

PC’s

Table 3

Cross validated error rates using 100

PC’s

Table 4

The cross validated error rates show that, even though classification tends to improve as
the number
of discriminants and/or the number of PC’s increase, the improvement is
negligible. Hence, using 10 discriminants and 23 PC’s is enough for good classification.

It was also found that new observations from the groups 17,39,14,27 were misclassified
more of
ten than not. Observations from class 14 were mostly classified into class 28 and
22. Likewise observations from class 17 were classified mostly into class 29 and 23;
similarly observations from class 39 were mostly classified into class 30.

To see the pr
oximity between these groups the data is plotted in the two dimensional
Fisher discriminant space. The plot is shown below:

No.of
discriminants

20

50

100

Error rate %

12.5

10

10

The plot shows all the groups and their centroids. It can be seen that the centroid of group
14 is closest to centroids of groups

24 and 28. Similarly groups 30 and 39 are near to
each other. Groups 27 and 23 are also close to group 17. The proximity between the
groups is clear. Hence the rule performs poorly in these cases.

Based on the error rates of
the two methods, it is clear t
hat both method 1 and method 2 give almost the same
classification.

Conclusion
s
:

1)

LDA and Fishers method for classification have been successfully applied in face
recognition.

2)

The cross validated error rates are small when these methods were used.

3)

Both t
he methods give approximately the same classification.

4)

The assumptions of multivariate normality given features and equal covariance
matrices can be assumed to be correct, based on the evidence given by cross
validated error rates.

References

1.

M. Turk, A.
Pentland. “
Eigenfaces for Recognition”. Journal of Cognitive
Neuroscience. Vol 3, No. 1. 71
-
86, 1991.

2.

Applied multivariate statistical analysis, Johnson and Wichern (1992).

Method 3:

Multiple logistic

regression: There are no assumptions in this rule. This method is
compared with the other two methods, as it is expected to give better results compared to
the other two. The parameters in the posterior probability model are estimated by
maximizing condit
ional likelihood treating feature vectors as fixed. Mathematically,

The classification rule then becomes

Using all the 40 classes for classification proved to be very expensive computationally.
Hence for this method only the first ten classes were

considered. For the purpose of
estimating the parameters, RPLC software was used. Multiple logistic regression is the
same as reference point logistic classification with one reference point per class. Using
RPLC with nk=10, gives one reference point per
class. The data has been divided into
training and test sets. The principal component scores of training set as well as test set
have been obtained based on the principal components of training set. The test and
training sets have been interchanged to do t
wo
-
cross validation. The following tables give
the results given by RPLC software for different choice of principal components:

loop

icv

nk

lambda

trn1

trn2

trn3

tst1

tst2

tst3

1

0

10

0.0000

0.0003

0.0003

0.0000

16.6038

0.0067

0.9000

loop

icv

nk

la
mbda

trn1

trn2

trn3

tst1

tst2

tst3

1

0

10

0.0000

0.0003

0.0003

0.0000

11.7149

0.1427

0.6600

Using

23

PC’s

Table 1

loop

icv

nk

lambda

trn1

trn2

trn3

tst1

tst2

tst3

1

0

10

0.0002

0.0002

0.0000

0.0000

12.6038

0.0288

0.8500

loop

icv

nk

lambda

trn1

trn2

trn3

tst1

tst2

tst3

1

0

10

0.0000

0.0004

0.0004

0.0000

12.9544

0.1189

0.6200

Using
15

PC’s

Table 2

loop

icv

nk

lambda

trn1

trn2

trn3

tst1

tst2

tst3

1

0

10

0.0000

0.0013

0.0013

0.0000

38.3964

0.0025

0.9000

loop

icv

nk

lambda

trn1

trn2

trn3

tst1

tst
2

tst3

1

0

10

0.0584

0.0002

0.0492

0.0000

30.7654

0.0867

0.7000

Using 5 PC’s

Table 3

From the tables it can be seen that the error rates on the training sets (column trn3) are
zero even with 5 PC’s.
But

the error rates on test sets are very high. Even wi
th 23 PC’s

An error rate of 90% is shown by the software.

Hence method 1 and method 2 perform well when compared to multiple logistic
regression.

Method 3
:
This method is the most robust of the three. The posterior probabilities are
modeled directly w
ithout any distributional assumptions concerning the marginal density
of the features. The form of the posterior probabilities is ass
umed to be the same as
obtained in method 1, the conditional distribution is not restricted to being multivariate
normal but to belong to a bigger family of exponential functions. The classification rule
used here is the same as mentioned in method 1. The
classifier chooses the group which
minimizes the expected loss of misclassification. As the method doesn’t assume equal
covariance matrices and multivariate normality, it is more robust from deviations against
these assumptions. (Equal losses are assumed).

3)

Multiple logistic regression (MLR) which doesn’t assume both of the above.