PEDESTRIAN DETECTION USING MULTIPLE FEATURES

connectionviewAI and Robotics

Nov 17, 2013 (3 years and 1 month ago)

126 views


1


Proceedings of the

7
th

Annual

ISC

Graduate
Research Symposium

ISC
-
G
RS

20
1
3

April

24
, 201
3
,
Rolla
,
Missouri


Yungxiang Mao

Department of Computer Science

Missouri University of Science and Technology, Rolla, MO 65409
PEDESTRIAN DETECTION

USING MULTIPLE FEATU
RES




ABSTRACT

Pedestrian detection has been a quite hot topic in the field of
computer vision. Due to
variation

of pedestrian pos
e,
illumination

condition
,
viewpoints

and also occlusion
, robust
pedestrian detection

is still a

challenge for researchers to
conquer.
To better distinguish pedestrian from background, we
need effective features to encode the Region Of Interest (ROI)
which

contains pedestrians. To achieve our goal, we propose to
combine three features to build our own classifier:
Histogram
of
Oriented
Gradient
(H
O
G) feature is adopted to describe the
shape information from one single image; celled Local Binary
Pattern (LBP)

provides texture
continuity

information;
Histogram of
Oriented Motion

(HOM)
feature makes use of
motion information to decide whether a given ROI contains a
pedestrian.
Experiment on our own airborne live video proves
that the proposed approach
has the po
tential
to distinguish
between pedestrians and background.

1.

INTRODUCTION

Several methods have been proposed to conquer the problem
that is how to detect pedestrian in live video robustly and
efficiently. A
mong all the approaches, there
exist

mainly two
philosophies: global descriptor
[1
-
6] and

part
-
based
descriptor
[7, 8]
.
O
ur
methodology

belongs to the global descriptor class.
Available

dataset
ETH

[9], TUD
-
Brussels [10], Daimler [11],
INRIA [1], and Caltech data set [13]
appear enough for both
training and test.
Just in the Caltech data set, it contains
350,000 pedestrian bounding boxes labeled in 250,000 frames.
As shown in figure (1). H
owever,
we cannot adopt these data
sets as our both training and test data sets for th
e following
reasons: (1),
pedestrians in these
available

dataset are all side
-
view while pedestrians in our airborne detection task display
great
variant

shape
, (2), we need motion information as one of
our three feature, but images in these data set are a
ll still
images.

Therefore, we build our own eagle
-
view data sets.

A
s for the learning process, Linear Support Vector
Machines
[14]
and boosted classifiers
[15]
are very popular
among most of current methods due to their good performance.
A
s for some detec
tion details, there does not exist too much
difference

[16]
. Therefore, the most important factor which
affects pedestrian detection performance is how to choose
suitable features.
A

significant number of features have been
explored in the past decade.
Dal
al and Triggs
H
OG [1] brings
large gain to the performance of pedestrian detection. Inspired
by HOG, Wang .e.t. proposed a feature combined HOG and
cell
-
structured LBP together and also a partial occlusion
handing method to improve the overall detection pe
rformance.
Motion is also a key cue for human perception. Vioal et al. [17]
successfully incorporated motion features into detectors,
resulting in large performance gain. Dalal et al. [18] build
motion model based on optical flow differences. Most of
curre
nt methods are use a single scalar model to scan different
scalar input images. Rodrigo Benenson in [19] proposed to use
multiple scalar models to scan one single scalar input image for
both speeding up purpose and eliminating blurring effect.




(a
) Caltech


(b) Caltech
-
Japan



(c) ETH (d)
our

data set

Fig. 1.
Example images cropped from six pedestrian detection
data sets. As it shows, the existing data sets pedestrians in (a)
(b) (c) are all side
-
view, but airb
orne videos are eagle
-
view.

1.1.

Contribution

Data
Set

Existing

popular data sets are all collected in the
side
-
view. Obviously this difference of view point between
existing data sets and our videos would harm our performance.
To obtain the optimal performance, we build our own data sets
by self
-
built quad copter wit
h GoPro mounted on it.

Also we
obtain motion data set which is not included in existing popular
data sets for training our motion feature.

Multiple Features

Single feature can detect pedestrian in a

reasonable performance but it cannot excess the performan
ce of
multiple features. Inspired by previous work, [3],
and [
18] we
proposed to build a three
-
feature classifier. Performance gain is
obvious through our experiment.

This paper is organized as follow: we introduce our
multiple features in section 2; our o
wn data set and details

2


involved in detection are introduced in section 3; in
section 4, we discuss the performance of multiple features
compared with single feature alone and HOG
-
LBP feature.

2.

MULTIPLE

F
EATURE
S

We are dealing with pedestrian detection in a
irborne videos,
which are more challenging. To achieve satisfactory
performance, new feature needs to be created. In our proposed
pedestrian detection procedure, we integrate HOG, cell
-
structured LBP and also
Histogram of
Oriented Motion (HOM)
together to
build a large scale feature space. The framework of
our multiple features classifiers is shown in Figure (2). Details
about how to build each feature are illustrated in next

3
subsections.




































Fig.2. Procedure of our proposed method.
Separately we collect
HOG, cell
-
structured LBP and HOM features to obtain our final
large
-
scale feature vector.

2.1
.
Histogram of
Oriented
Gradient

Histogram of
Oriented
Gradient (HOG), which is an
efficient descriptor for objects
gradient
in images, along with
Support Vector Machine, is highly popular in the last decade to
perform classification.
Briefly, the HOG method calculates the
orient and magnitude of each pixel, vote their orients into bins
in each cell. The weight of vote is accord
ing to the magnitude
of each pixel. Then we can normalize each histogram of cell by
the nearby 4 blocks each cell belongs to. The HOG effetely not
only dress the problem of how to normalize each cell to get rid
of the effect of illumination, but also sprea
d the effect of each
cell to their neighboring cells.

There are two ways for sliding window techniques.
Usually the most common one is to resize the input frame
image, and then use one scale model to scan those different
scalar images. While in [19]

it
sug
gests that using
multi
-
scale

classifiers for pedestrian detection not only conquers the
blurring effect of resizing image, but also speeds up the whole
procedure.
Therefore, i
n our proposed pedestrian
detection

system, we mainly follow
their

idea

to build
9 scale classifiers
to scan the input frame image
.



Fig.3.
V
isualization of HOG feature for pedestrian and
background

2.2
.
C
ell
-
structured Local Binary Pattern


While no single feature can perform better than HOG,
additional feature provides complementary information to
improve the performance.
LBP has been widely used in various
applications such as face recognition and has achieved good
results. Its key advantag
e is that it is invariant to monotonic
gray level change and computational efficiency. This makes it
possible for applications such as pedestrian detection.

Inspired by HOG, Wang
e
t

al.

[3]
proposed to
take

LBP
operator as a descriptor for pedestrians. The
y add cell
-
structured LBP feature as another augmented feature vector. It
is know the HOG feature perform poorly when there are some
noisy edges in the background. To this point, LBP can filter
these noises with the concept of uniform pattern
[20]
. To
comb
ine the characteristics of HOG and cell
-
structured LBP,
the descriptor which capture two features of pedestrians
perform better than descriptor with only HOG alone.

Followed by the procedure of extracting HOG feature, first
we extract the LBP pixel
-
wisely.

We use


8,1
LBP

[3] to extract
LBP of each pixel. For one pixel, we use its neighboring 8
pixels within radius 1 to get its LBP. For pixel whose value is
Current frame

Consecutive frame

Compute image
gradients

Weighted v
ote
into
bins in each cell

Normalize over
overlapping spatial
blocks

Image warping

Compute
differential frame
difference

C
ompute L
BP at
each pixel

Count transition
times

HOG

LBP

HOM

Weighted vote into
bins in each cell

Normalize over
overlapping spatial
blocks

Vote into bins in
each cell

Normalize over
overlapping spatial
blocks


Collected multiple features


3


great or equal to the central pixel, we write it as 1, otherwise we
write it as 0.

The second step is to count the 0
-
1 and 1
-
0
transitions of LBP. There are 8 bits for one LBP, so transition
times vary from 0 to 7, i.e. 8 bins in total. The third step is to
vote the LBP transition times of pixels within each cell to 8
bins. Finally we c
an normalize each cell with the four blocks
which they belong to.






Fig.
4
.

Example of

8,1
LBP
feature extraction.


2.3.

Histogram of
Oriented Motion

With the combination of HOG and cell
-
structured LBP,
pedestrian detection has achieved a better performance.
However, another notable feature which is highly useful
and
should be used
for distinguishing pedestrians and background is
the motion feature.

Although some one will argue that HOG has already
captured the boundary information of pedestrian, there is no
need to build HOM. However, we still insist to do so.
Admittedly, motion will appear along the boundary of
pedestrian, but the boundary of backgr
ound will disappear in
HOM but still exist in the HOG. Figure () show the difference
of HOG and HOF.

Since the videos are collected by the moving cameras
mounted on quad copters, to obtain the real motion of
pedestrians we must perform video stabilization

first. In this
way, backgrounds in consecutive frames appear stable as they
are in stationary cameras. Usually the interval between the two
frames to perform homograph should not be large since in that
way the noise of background will be large, it is not
suitable to
make background and pedestrians distinguishable.


Also inspired by the HOG feature, Histogram of Oriented
Motion (HOM) is a little different from the Motion descriptors
introduced in *(human detection using Oriented Histograms of
Flow and Appe
arance). Instead of using optical flow as the
input of HOM, we directly use frame difference as the input of
HOM. Denote frame difference as
c
I
, the x
-

and y
-
derivative
differential
as
,
cx cy
I I
.
W
e follow the procedu
re of HOG to
build HOM.
O
ne thing should be noticed is that the motion of
background is much less than pedestrians. If we still use the
same

normalization step as HOG, we do not use make good use
of this information to make pedestrian and background more
d
istinguishable.
T
herefore, in order to keep the difference
between motion of pedestrian and motion of background,
before we perform normalization as the same technique in
HOG, we set a threshold to
filter

those motions with low value.




Fig. 5.

Comparison

of

HOG feature and the
motion input
.

The
first column is the cropped background and pedestrian training
samples. The second column is the visualization of HOG
feature from background and pedestrian. The third column is
the motion of background a
nd pedestrian without filtering by a
threshold. The last column is the motion of background and
pedestrian
filtered

by a threshold
.

It is obvious that from our
human perception, HOGs of pedestrian and background do not
make a big difference, but the motion

feature does.

3.
TRAINING DATA SET AND SOME DETAILS

We use our own built
quad copter

to collect videos for building
our training dataset.
The dataset consists of 5 pieces of videos
which in total last more than 10 minutes.
Due to the high
frequency vibr
ation of quad

copter, videos taken by common
cameras tend to has much blurring effect, which will definitely
decrease the detection performance. To solve the blurring
effect, GoPro camera is used since it provides good video
stabilization

quality.
When cro
pping pedestrians from training
dataset, we fixed the ratio of height to weight of windows to
2:1. We divide the whole positive training dataset into
9

scales.
The height of positive training dataset starts from
28

pixels to
1
68

pixels.


Fig. 6. The distr
ibution of
the height of
training positive
samples



4


For the cell size, orientation number of HOG and the cost
C in SVM, we simply follow the original paper

s
suggestion
.
For the cell size and orientation number of HOM we use the
same parameters as HOG. Linear SVM is quite
popular

for
pedestrian detection due to its low time consuming. We use
SVM Lite to complete this task.


We start building classifiers at each
scale

wit
h more than
2386 positive training samples from 40 pedestrians along with
their flipped images and 150
random

selected negative training
samples. Meanwhile their motion samples are
also collected
.
W
e run the initial 9 classifiers on our training videos and

add
the different false positive to negative training dataset.
A

crucial

point in training which often is
underestimated is

that when the
size of training dataset is not big enough one specific pedestrian
with the same posture should not be added into the

positive
training dataset. If training samples of one pedestrian with the
similar postures are added to training dataset too many times,
this will increase the weight of that posture, then the whole
performance of pedestrian detection will decrease. This
also
applies to negative training dataset.

Motion information does not only works

as an effect
ive

feature, but also can provide us a strong cue for where
should we put the scanning window. However, due to the
vibration and mobility of
quad copter
, image stabilization must
be performed first to get motion information. Then we enlarge
the detection area
around

each motion point and perform
uniform sampling to cut down the detection cost.

4.

EVALUATION

Currently popular pedestrian detection is not su
itable for our
test due to both the different view point which will definitely
harm the performance and the lack of motion information for
training and test. Our training procedure and test are performed
on our own dataset.

In the evaluation, we try to tes
t three kinds of feature:
HOG alone, combination of HOG and LBP and our HOG
-
LBP
-
HOM feature. For the evaluation purpose, we adopt two
methods to compare these three features, i.e. recall and
precision.

Denote
dt
BB
is the detected bound
ing box,
gt
BB
is
the ground truth bounding box
.




gt
successful detected BB
recall
BB

 

(1)






successful detected BB
precision
all detected BB


(2)


We call the pedestrian is detected when the below
condition is
satisfied:



0
( )
0.5
( )
dt gt
dt gt
area BB BB
a
area BB BB

 



(
3
)


To take
the large detection time
-
consuming into
consideration, we evaluation these three feature on
102
pedestrians
. These
102 pedestrians

are
in different scale. And
also the viewpoints are no
t all the same.
Experiment result is
shown in Table.1.



HOG

HOG
-
LBP

HOG
-
LBP
-
HOM

recall

60.
1
%

72.3%

76.5%

precision

58.5%

65.5%

72.3%

T
able. 1.
C
omp
arison of three features by the measurement
of recall and precision

As shown is table1, our
multiple
-
feature classifier
outperform both HOG and HOG
-
LBP by 27%, 5% in recall and
23.5%
, 10.4
% in precision.

Some of our test images are shown
in figure 7.

More test image results using multiple
-
feature
classifier are shown in figure 8.





(a) (b)




(c)




(d)




(e)




(f)


Fig. 7. Test HOG, HOG
-
LBP and HOG
-
LBP
-
HOM classifier.
(a), (b)

are

the result of HOG classifier.
(c), (d)

are

the result of
HOG
-
LBP classifier.
(e), (f) are

the result of our HOG
-
LBP
-
HOM classifier.


5


5.

CONCLUSIONS



In this paper, we propose a three
-
feature classifier for
pedestrian detection. With adding well addressed Histogram of
Oriented Motion, our multi
-
feature detector performs better
than HOG alone an
d HOG
-
LBP detector. Also for the use of
airborne pedestrian detection, we build our own data set for
both training and test purposes. Motion information is also
added to the data set so that motion feature can be used.


One drawback of our work is that extracting three features
while use SVM for classification can be time
-
consuming.
“Feature mining” has been proposed by
Dolla´r

et al. [
21
]

to
efficiently utilize very large feature spaces using various
strategies. In the f
eature, we will try to transfer their work
and also parallel computing technology to ours to reach a
higher computing speed.

6
.

ACKNOWLEDGMENTS

The authors

would like to acknowledge
the
great
support of the
Intelligent Systems Center
.

7
.

REFERENCES

[1]

N. Dal
al and B. Triggs, “Histograms of Oriented
Gradients for Human Detection,” Proc. IEEE Conf.
Computer Vision and Pattern Recognition, 2005.

[2]

C. Wojek and B. Schiele
, 2001, “
A Performance
Evaluation of Single and Multi
-
Feature People
Detection
,”
Proc. DAGM Sym
p. Pattern Recognition,
2008.

[3]

X. Wang, T.X. Han, and S. Yan, “An HOG
-
LBP
Human Detector with Partial Occlusion Handling,”
Proc. IEEE Int’l Conf. Computer Vision, 2009.

[4]

S. Walk, N. Majer, K. Schindler, and B. Schiele, “New
Features and Insights for Pedestri
an Detection,” Proc.
IEEE Conf. Computer Vision and Pattern Recognition,
2010.

[5]

C. Wojek and B. Schiele, “A Performance Evaluation
of Single and Multi
-
Feature People Detection,” Proc.
DAGM Symp. Pattern Recognition, 2008.

[6]

P. Dolla´ r, Z. Tu, P. Perona, and
S. Belongie, “Integral
Channel Features,” Proc. British Machine Vision
Conf., 2009.

[7]

P. Felzenszwalb, D. McAllester, and D. Ramanan, “A
Discriminatively Trained, Multiscale, Deformable Part
Model,” Proc. IEEE Conf. Computer Vision and
Pattern Recognition,
2008.

[8]

C. Wojek and B. Schiele, “A Performance Evaluation
of Single and Multi
-
Feature People Detection,” Proc.
DAGM Symp. Pattern Recognition, 2008.

[9]

A. Ess, B. Leibe, and L. Van Gool, “Depth and
Appearance for Mobile Scene Analysis,” Proc. IEEE
Int’l Conf.
Computer Vision, 2007.

[10]

C. Wojek, S. Walk, and B. Schiele, “Multi
-
Cue
Onboard Pedestrian Detection,” Proc. IEEE Conf.
Computer Vision and Pattern Recognition, 2009.

[11]

M. Enzweiler and D.M. Gavrila, “Monocular
Pedestrian Detection: Survey and Experiments,”
IEEE
Trans. Pattern Analysis and Machine Intelligence, vol.
31, no. 12, pp. 2179
-

2195, Dec. 2009.

[12]

C. Wojek and B. Schiele, “A Performance Evaluation
of Single and Multi
-
Feature People Detection,” Proc.
DAGM Symp, Pattern Recognition, 2008.

[13]

P. Dolla´
r, C.
Wojek, B. Schiele, and P. Perona,
“Pedestrian Detection: A Benchmark,” Proc. IEEE
Conf. Computer Vision and Pattern Recognition, 2009.

[14]

C. Papageorgiou and T. Poggio, “A Trainable System
for Object Detection,” Int’l J. Computer Vision, vol.
38, no. 1, pp. 1
5
-
33, 2000.

[15]

S. Walk, K. Schindler, and B. Schiele, “Disparity
Statistics for Pedestrian Detection: Combining
Appearance, Motion and Stereo,” Proc. European
Conf. Computer Vision, 2010.

[16]

Piotr Dolla´ r, Christian Wojek, Bernt Schiele, and
Pietro Perona, “Ped
estrian Detection: An Evaluation of
the State of the Art”, IEEE Trans. Pattern Analysis and
Machine Intelligence, vol. 31, no. 12, pp. 2179
-

2195,
Dec. 2009.

[17]

P.A. Viola, M.J. Jones, and D. Snow, “Detecting
Pedestrians Using Patterns of Motion and
Appearanc
e,
” Int’l

J. Computer Vision, vol. 63, no. 2,
pp. 153
-
161, 2005.

[18]

N. Dalal, B. Triggs, and C. Schmid, “Human Detection
Using Oriented Histograms of Flow and Appearance,”
Proc. European Conf. Computer Vision, 2006.

[19]

Rodrigo Benenson, Markus Mathias, Radu Timo
fte and
Luc Van Gool, “Pedestrian detection at 100 frames per
second”, cvpr, 2012.

[20]

T. Ojala, M. Pietikinen, and D. Harwood. A
comparative study of texture measures with
classification based on feature distributions.
Pattern
Recognition
, 29(1):51

59, 1998.

[21]

P. Dolla´r, Z. Tu, H. Tao, and S. Belongie, “Feature
Mining for Image Classification,” Proc. IEEE Conf.
Computer Vision and Pattern

Recognition, 2007.










6








Fig. 8 HOG
-
LBP
-
HOM classifier test results.