A Robust Real Time Face
Detection
Outline
AdaBoost
–
Learning Algorithm
Face Detection in real life
Using AdaBoost for Face Detection
Improvements
Demonstration
AdaBoost
A short Introduction to Boosting (Freund & Schapire,
1999
)
Logistic Regression, AdaBoost and Bregman Distances
(Collins, Schapire, Singer,
2002
)
Boosting
The Horse

Racing Gambler Problem
–
Rules of thumb for a set of races
–
How should we choose the set of races in order
to get the best rules of thumb?
–
How should the rules be combined into a single
highly accurate prediction rule?
Boosting !
AdaBoost

the idea
Initialize sample weights
For each cycle:
–
Find a classifier that performs
well on the weighted sample
–
Increase weights of
misclassified examples
Return a weighted list of
classifiers
AdaBoost agglomerates many weak
classifiers into one strong classifier.
Shoe
size
Shoe
size
IQ
AdaBoost

algorithm
T
t
t
t
t
y
x
h
e
y
x
h
e
t
t
t
t
t
t
i
i
t
D
i
t
t
t
i
i
m
m
x
h
sign
x
H
Z
Z
i
D
i
D
y
x
h
X
h
D
T
y
m
i
D
Y
y
X
x
y
x
y
x
i
i
i
t
i
i
t
t
1
)
(
if
)
(
if
1
~
1
1
1
))
(
(
)
(
:
hypothesis
final
Output the
factor
ion
normalizat
a
is
where
)
(
)
(
e
Updat
)
1
ln(
2
1
Choose
]
)
(
[
Pr
error
with
}
1
,
1
{
:
hypothesis
Get weak
on
distributi
using
classifier
best weak
Select the
..
1
For
/
1
)
(
Initialize
}
1
,
1
{
,
where
)
,
(
),..,
,
(
Given
AdaBoost
–
training error
Freund and Schapire (
1997
) proved that:
AdaBoost ADApts to the error rates of the
individual weak hypotheses.
t
t
γ
e
H
err
T
t
t
2
1
where
,
)
(
'
1
2
AdaBoost
–
generalization error
Freund and Schapire (
1997
) showed that:
size
set
training

rounds
of
number

dimension
VC

sample
training
on the
y
probabilit
empirical
the

Pr
:
where
,
)
(
]
)
(
[
Pr'
)
(
m
T
d
y]
'[H(x)
m
Td
O
y
x
H
H
err
AdaBoost
–
generalization error
The analysis implies that boosting will overfit
if run for too many rounds
However, it was observed empirically that
AdaBoost does not overfit, even when run
thousands of rounds.
Moreover, it was observed that the
generalization error continues to drive down
long after training error reached zero
AdaBoost
–
generalization error
An alternative analysis was presented by
Schapire et al. (
1998
), that suits the empirical
findings
x
h
y
y
x
t
t
t
m
d
O
y
x
H
err
)
(
)
,
(
margin
:
where
2

1
]
)
(
]
)
,
(
margin
[
Pr'
)
(
Pr[
AdaBoost
–
different point of view
We try to solve the problem of approximating the
y
’s using a linear combination of weak hypotheses
In other words, we are interested in the problem of
finding a vector of parameters
α
such that
is a ‘good approximation’ of
y
i
For classification problems we try to match the
sign of
f(x
i
) to y
i
n
j
i
j
j
i
x
h
x
f
1
)
(
)
(
AdaBoost
–
different point of view
Sometimes it is advantageous to minimize some
other (non

negative) loss function instead of the
number of classification errors
For AdaBoost the loss function is
This point of view was used by
Collins, Schapire
and Singer (
2002
) to demonstrate that AdaBoost
converges to optimality
n
i
i
i
x
f
y
1
))
(
exp(
Face Detection
(not face recognition)
Face Detection in Monkeys
There are cells that
‘detect faces’
Face Detection in Human
There are ‘processes of
face detection’
Faces Are Special
We analyze faces in a
‘different way’
Faces Are Special
We analyze faces in a
‘different way’
Faces Are Special
We analyze faces in a
‘different way’
Face Recognition in Human
We analyze faces ‘in a
specific location’
Robust Real

Time Face
Detection
Viola and Jones,
2003
Features
Picture analysis, Integral Image
Features
The system classifies images based on the value
of simple features
Two

rectangle
Three

rectangle
Four

rectangle
Value =
∑ (pixels in white area)

∑
(pixels in black area)
Contrast Features
Source
Result
Features
Notice that each feature is related to a
special location in the sub

window
Why features and not pixels?
–
Encode domain knowledge
–
Feature based system operates faster
–
Inspiration from human V
1
Features
Later we will see that there are other
features that can be used to implement an
efficient face detector
The original system of Viola and Jones used
only rectangle features
Computing Features
Given a detection resolution of
24
x
24
, and
size of ~
200
x
200
, the set of rectangle
features is ~
160
,
000
!
We need to find a way to rapidly compute
the features
Integral Image
Intermediate
representation of the
image
Computed in one pass
over the original image
y
y
x
x
y
x
i
y
x
ii
'
,
'
)
'
,
'
(
)
,
(
0
)
,
1
(
0
)
1
,
(
)
,
(
)
,
1
(
)
,
(
)
,
(
)
1
,
(
)
,
(
y
ii
x
s
y
x
s
y
x
ii
y
x
ii
y
x
i
y
x
s
y
x
s
Integral Image
Using the integral image representation
one can compute the value of any
rectangular sum in constant time.
For example the integral sum inside
rectangle D we can compute as:
ii
(
4
) +
ii
(
1
)
–
ii
(
2
)
–
ii
(
3
)
(x,y)
s
(
x
,
y
) =
s
(
x
,
y

1
) +
i
(
x
,
y
)
ii
(
x
,
y
) =
ii
(
x

1
,
y
) + s(
x
,
y
)
(
0
,
0
)
x
y
Integral Image

1
+
1
+
2

1

2
+
1
Integral
Image
(x,y)
(x,y)
Building a Detector
Cascading, training a cascade
Main Ideas
The Features will be used as weak
classifiers
We will concatenate several detectors
serially into a cascade
We will boost (using a version of AdaBoost)
a number of features to get ‘good enough’
detectors
Main Ideas
The Features will be used as weak
classifiers
We will concatenate several detectors
serially into a cascade
We will boost (using a version of AdaBoost)
a number of features to get ‘good enough’
detectors
Weak Classifiers
Weak Classifier : A feature which best separates
the examples
Given a sub

window (
x
), a feature (
f
), a threshold
(
Θ
), and a polarity (
p
) indicating the direction of
the inequality:
p
x
pf
p
f
x
h
)
(
1
)
,
,
,
(
Weak Classifiers
A weak classifier is a combination of a
feature and a threshold
We have
K
features
We have
N
thresholds where
N
is the
number of examples
Thus there are
KN
weak classifiers
Weak Classifier Selection
For each feature sort the examples based on
feature value
For each element evaluate the total sum of
positive/negative example weights (T+/T

) and
the sum of positive/negative weights below the
current example (S+/S

)
The error for a threshold which splits the range
between the current and previous example in the
sorted list is
))
(
),
(
min(
S
T
S
S
T
S
e
An example
e
B
A
S

S+
T

T+
W
f
y
x
2/5
3/5
2/5
0
0
2/5
3/5
1/5
2

1
X1
1/5
4/5
1/5
1/5
0
2/5
3/5
1/5
3

1
X2
0
5/5
0
2/5
0
2/5
3/5
1/5
5
1
X3
1/5
4/5
1/5
2/5
1/5
2/5
3/5
1/5
7
1
X4
2/5
3/5
2/5
2/5
2/5
2/5
3/5
1/5
8
1
X5
Main Ideas
The Features will be used as weak
classifiers
We will concatenate several detectors
serially into a cascade
We will boost (using a version of AdaBoost)
a number of features to get ‘good enough’
detectors
Main Ideas
The Features will be used as weak
classifiers
We will concatenate several detectors
serially into a cascade
We will boost (using a version of AdaBoost)
a number of features to get ‘good enough’
detectors
Cascading
We start with simple classifiers which reject
many of the negative sub

windows while
detecting almost all positive sub

windows
Positive results from the first classifier
triggers the evaluation of a second (more
complex) classifier, and so on
A negative outcome at any point leads to the
immediate rejection of the sub

window
Cascading
Main Ideas
The Features will be used as weak
classifiers
We will concatenate several detectors
serially into a cascade
We will boost (using a version of AdaBoost)
a number of features to get ‘good enough’
detectors
Main Ideas
The Features will be used as weak
classifiers
We will concatenate several detectors
serially into a cascade
We will boost (using a version of
AdaBoost) a number of features to get
‘good enough’ detectors
Training a cascade
User selects values for:
–
Maximum acceptable false positive rate per
layer
–
Minimum acceptable detection rate per layer
–
Target overall false positive rate
User gives a set of positive and negative
examples
Training a cascade (cont.)
While the overall false positive rate is not met:
–
While the false positive rate of current layer is less than
the maximum per layer:
Train a classifier with
n
features using AdaBoost on set of
positive and negative examples
Decrease threshold for current classifier detection rate of the
layer is more than the minimum
Evaluate current cascade classifier on validation set
–
Evaluate current cascade detector on a set of non faces
images and put any false detections into the negative
training set
Results
Training Data Set
4916
hand labeled faces
Aligned to base resolution
(
24
x
24
)
Non faces for first layer
were collected from
9500
non faces images
Non faces for subsequent
layers were obtained by
scanning the partial
cascade across non faces
and collecting false
positives (max
6000
for
each layer)
Structure of the Detector
38
layer cascade
6060
features
Layer number
1
2
3 to 4
5 to 38
Number of feautures
2
10
50

Detection rate
100%
100%


Rejection rate
50%
80%


Speed of final Detector
On a
700
Mhz Pentium III processor, the
face detector can process a
384
by
288
pixel image in about .
067
seconds
Improvements
Learning Object Detection from a Small
Number of Examples: the Importance of
Good Features (Levy & Weiss,
2004
)
Improvements
Performance depends crucially on the
features that are used to represent the
objects (Levy & Weiss,
2004
)
Good Features imply:
–
Good results from small training databases
–
Better generalization abilities
–
Shorter (faster) classifiers
Edge Orientation Histogram
Invariant to global illumination changes
Captures geometric properties of faces
Domain knowledge represented:
–
Inner part of the face includes more horizontal edges then vertical
–
The ration between vertical and horizontal edges is bounded
–
The area of the eyes includes mainly horizontal edges
–
The chin has more or less the same number of oblique edges on
both sides
Edge Orientation Histogram
The EOH can be calculated using some kind
of Integral Image:
–
We find the gradients at the point (x,y) using
Sobel masks
–
We calculate the orientation of the edge (x,y)
–
We divide the edges into K bins
–
The result is stored in K matrices
–
We use the same idea of Integral Image for the
matrices
EOH Features
The ratio between two
orientations
The dominance of a given
orientation
Symmetry Features
Results
Already with only
250
positive examples we
can see above
90
% detection rate
Faster classifier
Better performance in profile faces
Demo
Implementing Viola & Jones system
Frank Fritze,
2004
Comments 0
Log in to post a comment