About Face Recognition

brasscoffeeAI and Robotics

Nov 17, 2013 (3 years and 9 months ago)

106 views

A Robust Real Time Face
Detection

Outline


AdaBoost


Learning Algorithm


Face Detection in real life


Using AdaBoost for Face Detection


Improvements


Demonstration

AdaBoost

A short Introduction to Boosting (Freund & Schapire,
1999
)

Logistic Regression, AdaBoost and Bregman Distances
(Collins, Schapire, Singer,
2002
)

Boosting


The Horse
-
Racing Gambler Problem


Rules of thumb for a set of races


How should we choose the set of races in order
to get the best rules of thumb?


How should the rules be combined into a single
highly accurate prediction rule?


Boosting !

AdaBoost
-

the idea


Initialize sample weights


For each cycle:


Find a classifier that performs
well on the weighted sample


Increase weights of
misclassified examples


Return a weighted list of
classifiers


AdaBoost agglomerates many weak
classifiers into one strong classifier.


Shoe
size

Shoe
size

IQ

AdaBoost
-

algorithm





























T
t
t
t
t
y
x
h
e
y
x
h
e
t
t
t
t
t
t
i
i
t
D
i
t
t
t
i
i
m
m
x
h
sign
x
H
Z
Z
i
D
i
D
y
x
h
X
h
D
T
y
m
i
D
Y
y
X
x
y
x
y
x
i
i
i
t
i
i
t
t
1
)
(

if

)
(

if

1
~
1
1
1
))
(
(
)
(


:
hypothesis

final

Output the
factor
ion
normalizat

a

is


where
)
(
)
(

e
Updat
)
1
ln(
2
1

Choose

]
)
(
[
Pr
error
with
}
1
,
1
{
:

hypothesis
Get weak

on
distributi

using

classifier
best weak

Select the

..
1
For
/
1
)
(

Initialize
}
1
,
1
{
,

where
)
,
(
),..,
,
(
Given







AdaBoost


training error


Freund and Schapire (
1997
) proved that:





AdaBoost ADApts to the error rates of the
individual weak hypotheses.


t
t
γ
e
H
err
T
t
t








2
1

where
,
)
(
'
1
2
AdaBoost


generalization error


Freund and Schapire (
1997
) showed that:



size
set

training
-

rounds

of
number

-

dimension

VC

-

sample

training
on the
y
probabilit

empirical

the
-
Pr
:
where

,
)
(
]
)
(
[
Pr'
)
(
m
T
d
y]
'[H(x)
m
Td
O
y
x
H
H
err




AdaBoost


generalization error


The analysis implies that boosting will overfit
if run for too many rounds


However, it was observed empirically that
AdaBoost does not overfit, even when run
thousands of rounds.


Moreover, it was observed that the
generalization error continues to drive down
long after training error reached zero

AdaBoost


generalization error


An alternative analysis was presented by
Schapire et al. (
1998
), that suits the empirical
findings


x
h
y
y
x
t
t
t
m
d
O
y
x
H
err












)
(
)
,
(
margin
:
where
2

-
1
]

)
(
]
)
,
(
margin
[
Pr'
)
(
Pr[
AdaBoost


different point of view


We try to solve the problem of approximating the
y
’s using a linear combination of weak hypotheses


In other words, we are interested in the problem of
finding a vector of parameters
α
such that




is a ‘good approximation’ of
y
i


For classification problems we try to match the
sign of
f(x
i
) to y
i






n
j
i
j
j
i
x
h
x
f
1
)
(
)
(


AdaBoost


different point of view


Sometimes it is advantageous to minimize some
other (non
-
negative) loss function instead of the
number of classification errors


For AdaBoost the loss function is





This point of view was used by
Collins, Schapire
and Singer (
2002
) to demonstrate that AdaBoost
converges to optimality





n
i
i
i
x
f
y
1
))
(
exp(

Face Detection

(not face recognition)


Face Detection in Monkeys

There are cells that
‘detect faces’

Face Detection in Human

There are ‘processes of
face detection’

Faces Are Special

We analyze faces in a
‘different way’

Faces Are Special

We analyze faces in a
‘different way’

Faces Are Special

We analyze faces in a
‘different way’

Face Recognition in Human

We analyze faces ‘in a
specific location’

Robust Real
-
Time Face
Detection

Viola and Jones,
2003

Features

Picture analysis, Integral Image

Features


The system classifies images based on the value
of simple features

Two
-
rectangle

Three
-
rectangle

Four
-
rectangle

Value =

∑ (pixels in white area)
-


(pixels in black area)


Contrast Features

Source

Result

Features


Notice that each feature is related to a
special location in the sub
-
window


Why features and not pixels?


Encode domain knowledge


Feature based system operates faster


Inspiration from human V
1



Features


Later we will see that there are other
features that can be used to implement an
efficient face detector


The original system of Viola and Jones used
only rectangle features

Computing Features


Given a detection resolution of
24
x
24
, and
size of ~
200
x
200
, the set of rectangle
features is ~
160
,
000
!


We need to find a way to rapidly compute
the features

Integral Image


Intermediate
representation of the
image






Computed in one pass
over the original image










y
y
x
x
y
x
i
y
x
ii
'
,
'
)
'
,
'
(
)
,
(
0
)
,
1
(
0
)
1
,
(
)
,
(
)
,
1
(
)
,
(
)
,
(
)
1
,
(
)
,
(










y
ii
x
s
y
x
s
y
x
ii
y
x
ii
y
x
i
y
x
s
y
x
s
Integral Image

Using the integral image representation
one can compute the value of any
rectangular sum in constant time.

For example the integral sum inside
rectangle D we can compute as:

ii
(
4
) +
ii
(
1
)


ii
(
2
)


ii
(
3
)

(x,y)

s
(
x
,
y
) =
s
(
x
,
y
-
1
) +
i
(
x
,
y
)

ii
(
x
,
y
) =
ii
(
x
-
1
,
y
) + s(
x
,
y
)

(
0
,
0
)

x

y

Integral Image

-
1

+
1

+
2

-
1

-
2

+
1

Integral
Image

(x,y)

(x,y)

Building a Detector

Cascading, training a cascade

Main Ideas


The Features will be used as weak
classifiers


We will concatenate several detectors
serially into a cascade


We will boost (using a version of AdaBoost)
a number of features to get ‘good enough’
detectors




Main Ideas


The Features will be used as weak
classifiers


We will concatenate several detectors
serially into a cascade


We will boost (using a version of AdaBoost)
a number of features to get ‘good enough’
detectors



Weak Classifiers

Weak Classifier : A feature which best separates
the examples

Given a sub
-
window (
x
), a feature (
f
), a threshold
(
Θ
), and a polarity (
p
) indicating the direction of
the inequality:





p
x
pf
p
f
x
h



)
(
1
)
,
,
,
(
Weak Classifiers


A weak classifier is a combination of a
feature and a threshold


We have
K
features


We have
N

thresholds where
N

is the
number of examples


Thus there are
KN

weak classifiers

Weak Classifier Selection


For each feature sort the examples based on
feature value


For each element evaluate the total sum of
positive/negative example weights (T+/T
-
) and
the sum of positive/negative weights below the
current example (S+/S
-
)


The error for a threshold which splits the range
between the current and previous example in the
sorted list is



))
(
),
(
min(











S
T
S
S
T
S
e
An example



e

B

A

S
-

S+

T
-

T+

W

f

y

x

2/5

3/5

2/5

0

0

2/5

3/5

1/5

2

-
1

X1

1/5

4/5

1/5

1/5

0

2/5

3/5

1/5

3

-
1

X2

0

5/5

0

2/5

0

2/5

3/5

1/5

5

1

X3

1/5

4/5

1/5

2/5

1/5

2/5

3/5

1/5

7

1

X4

2/5

3/5

2/5

2/5

2/5

2/5

3/5

1/5

8

1

X5

Main Ideas


The Features will be used as weak
classifiers


We will concatenate several detectors
serially into a cascade


We will boost (using a version of AdaBoost)
a number of features to get ‘good enough’
detectors




Main Ideas


The Features will be used as weak
classifiers


We will concatenate several detectors
serially into a cascade


We will boost (using a version of AdaBoost)
a number of features to get ‘good enough’
detectors




Cascading


We start with simple classifiers which reject
many of the negative sub
-
windows while
detecting almost all positive sub
-
windows


Positive results from the first classifier
triggers the evaluation of a second (more
complex) classifier, and so on


A negative outcome at any point leads to the
immediate rejection of the sub
-
window

Cascading

Main Ideas


The Features will be used as weak
classifiers


We will concatenate several detectors
serially into a cascade


We will boost (using a version of AdaBoost)
a number of features to get ‘good enough’
detectors




Main Ideas


The Features will be used as weak
classifiers


We will concatenate several detectors
serially into a cascade


We will boost (using a version of
AdaBoost) a number of features to get
‘good enough’ detectors




Training a cascade


User selects values for:


Maximum acceptable false positive rate per
layer


Minimum acceptable detection rate per layer


Target overall false positive rate


User gives a set of positive and negative
examples

Training a cascade (cont.)


While the overall false positive rate is not met:


While the false positive rate of current layer is less than
the maximum per layer:


Train a classifier with
n

features using AdaBoost on set of
positive and negative examples


Decrease threshold for current classifier detection rate of the
layer is more than the minimum


Evaluate current cascade classifier on validation set


Evaluate current cascade detector on a set of non faces
images and put any false detections into the negative
training set

Results

Training Data Set


4916
hand labeled faces


Aligned to base resolution
(
24
x
24
)


Non faces for first layer
were collected from
9500
non faces images


Non faces for subsequent
layers were obtained by
scanning the partial
cascade across non faces
and collecting false
positives (max
6000
for
each layer)


Structure of the Detector


38
layer cascade


6060
features



Layer number
1
2
3 to 4
5 to 38
Number of feautures
2
10
50
-
Detection rate
100%
100%
-
-
Rejection rate
50%
80%
-
-
Speed of final Detector


On a
700
Mhz Pentium III processor, the
face detector can process a
384
by
288
pixel image in about .
067
seconds


Improvements

Learning Object Detection from a Small
Number of Examples: the Importance of
Good Features (Levy & Weiss,
2004
)

Improvements


Performance depends crucially on the
features that are used to represent the
objects (Levy & Weiss,
2004
)


Good Features imply:


Good results from small training databases


Better generalization abilities


Shorter (faster) classifiers

Edge Orientation Histogram


Invariant to global illumination changes


Captures geometric properties of faces


Domain knowledge represented:


Inner part of the face includes more horizontal edges then vertical


The ration between vertical and horizontal edges is bounded


The area of the eyes includes mainly horizontal edges


The chin has more or less the same number of oblique edges on
both sides

Edge Orientation Histogram


The EOH can be calculated using some kind
of Integral Image:


We find the gradients at the point (x,y) using
Sobel masks


We calculate the orientation of the edge (x,y)


We divide the edges into K bins


The result is stored in K matrices


We use the same idea of Integral Image for the
matrices

EOH Features


The ratio between two
orientations


The dominance of a given
orientation


Symmetry Features

Results


Already with only
250
positive examples we
can see above
90
% detection rate


Faster classifier


Better performance in profile faces

Demo

Implementing Viola & Jones system

Frank Fritze,
2004