AI and Robotics

Nov 17, 2013 (4 years and 6 months ago)

123 views

A Robust Real Time Face
Detection

Outline

Learning Algorithm

Face Detection in real life

Improvements

Demonstration

A short Introduction to Boosting (Freund & Schapire,
1999
)

Logistic Regression, AdaBoost and Bregman Distances
(Collins, Schapire, Singer,
2002
)

Boosting

The Horse
-
Racing Gambler Problem

Rules of thumb for a set of races

How should we choose the set of races in order

How should the rules be combined into a single
highly accurate prediction rule?

Boosting !

-

the idea

Initialize sample weights

For each cycle:

Find a classifier that performs
well on the weighted sample

Increase weights of
misclassified examples

Return a weighted list of
classifiers

classifiers into one strong classifier.

Shoe
size

Shoe
size

IQ

-

algorithm

T
t
t
t
t
y
x
h
e
y
x
h
e
t
t
t
t
t
t
i
i
t
D
i
t
t
t
i
i
m
m
x
h
sign
x
H
Z
Z
i
D
i
D
y
x
h
X
h
D
T
y
m
i
D
Y
y
X
x
y
x
y
x
i
i
i
t
i
i
t
t
1
)
(

if

)
(

if

1
~
1
1
1
))
(
(
)
(

:
hypothesis

final

Output the
factor
ion
normalizat

a

is

where
)
(
)
(

e
Updat
)
1
ln(
2
1

Choose

]
)
(
[
Pr
error
with
}
1
,
1
{
:

hypothesis
Get weak

on
distributi

using

classifier
best weak

Select the

..
1
For
/
1
)
(

Initialize
}
1
,
1
{
,

where
)
,
(
),..,
,
(
Given

training error

Freund and Schapire (
1997
) proved that:

individual weak hypotheses.

t
t
γ
e
H
err
T
t
t

2
1

where
,
)
(
'
1
2

generalization error

Freund and Schapire (
1997
) showed that:

size
set

training
-

rounds

of
number

-

dimension

VC

-

sample

training
on the
y
probabilit

empirical

the
-
Pr
:
where

,
)
(
]
)
(
[
Pr'
)
(
m
T
d
y]
'[H(x)
m
Td
O
y
x
H
H
err

generalization error

The analysis implies that boosting will overfit
if run for too many rounds

However, it was observed empirically that
AdaBoost does not overfit, even when run
thousands of rounds.

Moreover, it was observed that the
generalization error continues to drive down
long after training error reached zero

generalization error

An alternative analysis was presented by
Schapire et al. (
1998
), that suits the empirical
findings

x
h
y
y
x
t
t
t
m
d
O
y
x
H
err

)
(
)
,
(
margin
:
where
2

-
1
]

)
(
]
)
,
(
margin
[
Pr'
)
(
Pr[

different point of view

We try to solve the problem of approximating the
y
’s using a linear combination of weak hypotheses

In other words, we are interested in the problem of
finding a vector of parameters
α
such that

is a ‘good approximation’ of
y
i

For classification problems we try to match the
sign of
f(x
i
) to y
i

n
j
i
j
j
i
x
h
x
f
1
)
(
)
(

different point of view

Sometimes it is advantageous to minimize some
other (non
-
negative) loss function instead of the
number of classification errors

For AdaBoost the loss function is

This point of view was used by
Collins, Schapire
and Singer (
2002
converges to optimality

n
i
i
i
x
f
y
1
))
(
exp(

Face Detection

(not face recognition)

Face Detection in Monkeys

There are cells that
‘detect faces’

Face Detection in Human

There are ‘processes of
face detection’

Faces Are Special

We analyze faces in a
‘different way’

Faces Are Special

We analyze faces in a
‘different way’

Faces Are Special

We analyze faces in a
‘different way’

Face Recognition in Human

We analyze faces ‘in a
specific location’

Robust Real
-
Time Face
Detection

Viola and Jones,
2003

Features

Picture analysis, Integral Image

Features

The system classifies images based on the value
of simple features

Two
-
rectangle

Three
-
rectangle

Four
-
rectangle

Value =

∑ (pixels in white area)
-

(pixels in black area)

Contrast Features

Source

Result

Features

Notice that each feature is related to a
special location in the sub
-
window

Why features and not pixels?

Encode domain knowledge

Feature based system operates faster

Inspiration from human V
1

Features

Later we will see that there are other
features that can be used to implement an
efficient face detector

The original system of Viola and Jones used
only rectangle features

Computing Features

Given a detection resolution of
24
x
24
, and
size of ~
200
x
200
, the set of rectangle
features is ~
160
,
000
!

We need to find a way to rapidly compute
the features

Integral Image

Intermediate
representation of the
image

Computed in one pass
over the original image

y
y
x
x
y
x
i
y
x
ii
'
,
'
)
'
,
'
(
)
,
(
0
)
,
1
(
0
)
1
,
(
)
,
(
)
,
1
(
)
,
(
)
,
(
)
1
,
(
)
,
(

y
ii
x
s
y
x
s
y
x
ii
y
x
ii
y
x
i
y
x
s
y
x
s
Integral Image

Using the integral image representation
one can compute the value of any
rectangular sum in constant time.

For example the integral sum inside
rectangle D we can compute as:

ii
(
4
) +
ii
(
1
)

ii
(
2
)

ii
(
3
)

(x,y)

s
(
x
,
y
) =
s
(
x
,
y
-
1
) +
i
(
x
,
y
)

ii
(
x
,
y
) =
ii
(
x
-
1
,
y
) + s(
x
,
y
)

(
0
,
0
)

x

y

Integral Image

-
1

+
1

+
2

-
1

-
2

+
1

Integral
Image

(x,y)

(x,y)

Building a Detector

Main Ideas

The Features will be used as weak
classifiers

We will concatenate several detectors

We will boost (using a version of AdaBoost)
a number of features to get ‘good enough’
detectors

Main Ideas

The Features will be used as weak
classifiers

We will concatenate several detectors

We will boost (using a version of AdaBoost)
a number of features to get ‘good enough’
detectors

Weak Classifiers

Weak Classifier : A feature which best separates
the examples

Given a sub
-
window (
x
), a feature (
f
), a threshold
(
Θ
), and a polarity (
p
) indicating the direction of
the inequality:

p
x
pf
p
f
x
h

)
(
1
)
,
,
,
(
Weak Classifiers

A weak classifier is a combination of a
feature and a threshold

We have
K
features

We have
N

thresholds where
N

is the
number of examples

Thus there are
KN

weak classifiers

Weak Classifier Selection

For each feature sort the examples based on
feature value

For each element evaluate the total sum of
positive/negative example weights (T+/T
-
) and
the sum of positive/negative weights below the
current example (S+/S
-
)

The error for a threshold which splits the range
between the current and previous example in the
sorted list is

))
(
),
(
min(

S
T
S
S
T
S
e
An example

e

B

A

S
-

S+

T
-

T+

W

f

y

x

2/5

3/5

2/5

0

0

2/5

3/5

1/5

2

-
1

X1

1/5

4/5

1/5

1/5

0

2/5

3/5

1/5

3

-
1

X2

0

5/5

0

2/5

0

2/5

3/5

1/5

5

1

X3

1/5

4/5

1/5

2/5

1/5

2/5

3/5

1/5

7

1

X4

2/5

3/5

2/5

2/5

2/5

2/5

3/5

1/5

8

1

X5

Main Ideas

The Features will be used as weak
classifiers

We will concatenate several detectors

We will boost (using a version of AdaBoost)
a number of features to get ‘good enough’
detectors

Main Ideas

The Features will be used as weak
classifiers

We will concatenate several detectors

We will boost (using a version of AdaBoost)
a number of features to get ‘good enough’
detectors

many of the negative sub
-
windows while
detecting almost all positive sub
-
windows

Positive results from the first classifier
triggers the evaluation of a second (more
complex) classifier, and so on

A negative outcome at any point leads to the
immediate rejection of the sub
-
window

Main Ideas

The Features will be used as weak
classifiers

We will concatenate several detectors

We will boost (using a version of AdaBoost)
a number of features to get ‘good enough’
detectors

Main Ideas

The Features will be used as weak
classifiers

We will concatenate several detectors

We will boost (using a version of
AdaBoost) a number of features to get
‘good enough’ detectors

User selects values for:

Maximum acceptable false positive rate per
layer

Minimum acceptable detection rate per layer

Target overall false positive rate

User gives a set of positive and negative
examples

While the overall false positive rate is not met:

While the false positive rate of current layer is less than
the maximum per layer:

Train a classifier with
n

features using AdaBoost on set of
positive and negative examples

Decrease threshold for current classifier detection rate of the
layer is more than the minimum

Evaluate current cascade classifier on validation set

Evaluate current cascade detector on a set of non faces
images and put any false detections into the negative
training set

Results

Training Data Set

4916
hand labeled faces

Aligned to base resolution
(
24
x
24
)

Non faces for first layer
were collected from
9500
non faces images

Non faces for subsequent
layers were obtained by
scanning the partial
and collecting false
positives (max
6000
for
each layer)

Structure of the Detector

38

6060
features

Layer number
1
2
3 to 4
5 to 38
Number of feautures
2
10
50
-
Detection rate
100%
100%
-
-
Rejection rate
50%
80%
-
-
Speed of final Detector

On a
700
Mhz Pentium III processor, the
face detector can process a
384
by
288
067
seconds

Improvements

Learning Object Detection from a Small
Number of Examples: the Importance of
Good Features (Levy & Weiss,
2004
)

Improvements

Performance depends crucially on the
features that are used to represent the
objects (Levy & Weiss,
2004
)

Good Features imply:

Good results from small training databases

Better generalization abilities

Shorter (faster) classifiers

Edge Orientation Histogram

Invariant to global illumination changes

Captures geometric properties of faces

Domain knowledge represented:

Inner part of the face includes more horizontal edges then vertical

The ration between vertical and horizontal edges is bounded

The area of the eyes includes mainly horizontal edges

The chin has more or less the same number of oblique edges on
both sides

Edge Orientation Histogram

The EOH can be calculated using some kind
of Integral Image:

We find the gradients at the point (x,y) using

We calculate the orientation of the edge (x,y)

We divide the edges into K bins

The result is stored in K matrices

We use the same idea of Integral Image for the
matrices

EOH Features

The ratio between two
orientations

The dominance of a given
orientation

Symmetry Features

Results

250
positive examples we
can see above
90
% detection rate

Faster classifier

Better performance in profile faces

Demo

Implementing Viola & Jones system

Frank Fritze,
2004