A Machine Learning Approach to Object Recognition in the Context of Visual Road Scene Analysis from a Moving Vehicle

spraytownspeakerΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

65 εμφανίσεις

A Machine Learning Approach to Object Recognition in the Context of
Visual
Road Scene Analysis from a Moving Vehicle


Jivko Sinapov

Dept. of Computer Science

Iowa State University



I
. Introduction



This project address an object recognition task related
to

visual

road scene
analysis in the context of a moving vehicle. The goal is to implement and evaluate a
robust object recognition technique for detection of various objects of interest. Object
recognition tasks such as detecting other cars and traffic si
gns are very important when
designing driving assistive systems or autonomous driving agents. In this project we
implement and evaluate an object detection scheme utilizing a cascade of Haar feature
classifiers, as well as a boo
sting technique utilizing SV
M.




II
. Background and Motivation



The success of several of the challengers in last year’s DARPA Grand Challenge
shows that computer vision can be used effectively in solving many of the problems
associated with autonomous driving. Many problems remain

unsolved, however. For
example, the computers that processed the data in the autonomous vehicles participating
in the challenge are far superior to the average PC and it is unlikely that people at large
would be able to outfit theirs cars with such system
s.
In the coming years we are likely to
see various driving assistive technologies appear on the market and t
here is currently a
large overlap between the set of problems associated with autonomous driving and that of
problems associated in the area of dri
ving assistive technology. A large fraction of
accidents occurs because the driver is not paying attention to the road and cars in front of
their vehicle. For example, a driver not paying attention can easily veer off course and
enter an undesired lane, or

fail to stop at a traffic light. As such, real time traffic light, and
vehicle detection are very appropriate problems to tackle since any driving assistive or
autonomous driving system would have to be able to perform those tasks.


The primary goal for
this project is to provide appropriate solutions for these
problems which can work in real time on a regular PC. In particular, we’ll take a look at
the problems of detecting traffic lights and other cars in the field of view. It is
conceivable that in the

near future cars would come equipped with systems which
monitor the road, as well as the driver in order to determine if he or she is not paying
attention to the road. In such cases, the system must be able to detect situations which
demand the driver’s i
mmediate attention


for example, if the car is approaching a red
light at high speed or if the car in front is suddenly slowing down. In order for such a
system to work, it will need to be able to accurately detect the objects of interest and a
machine l
earning approach is likely to provide such a solution.




III
. Object Recognition Using a Haar Cascade Classifier


In the task of object recognition, we implement an approach which classifies
objects based on an extended set of Haar features. This approac
hed was originall
y
proposed by Viola and Jones [1
] and extended by Lienhart [
2
].


The detection scheme uses the values of Haar
-
like

features in an image in order to
classify an object as a positive or negative instance.
A subset of simple features used in

this model is

shown in Figure 1.








Figure 1: Some simple examples of Haar
-
based features



Each of these features consists of a geometric representation of two regions


black and white. The value of each fea
ture

at a given position in the image

is the
difference between the sum
s

of the pixels within the two regions.
Haar features can take
arbitrarily complex shapes and the

size of the

full set available in this model is in the
order of tens of thousands.
In o
rder to compute the value

of each feature at a given
location of the image, the image is represented in an integral form: the value at position
(x, y) in the integral image will contain the sum of pixels that are above y and to the left
of x. The general f
ormula for the integral image representation is the following:



The integral image representation is chosen for several reasons.
First
,

it allows for
efficient computation of a given Haar feature at a given position of the test imag
e. In
addition, i
t allows for robust object detection regardless of global lightning conditions,
since the Haar features take into account only the differences of sums of pixels, which are
invariant in terms of the global intensity of the image.
Last but n
ot least,
the integral
image representation allows for the
object detection algorithm

to scan for objects at
different scales very efficiently since scaling the integral image can be done much faster
than scaling the RGB image

[1]
. This is a very desirable

property since real
-
time usability
is a major goal for

this object recognition system.


The classifier is built in stages


at each stage, an AdaBoost
-
like approach is
applied to selecting one or more Haar
-
features, as well as determining appropriate
thr
esholds which can be applied to reject a large number of negative training instances.
Important
input
parameters

for the training procedure

are the
minimum hit ratio

and
maximum false alarm rate



the search for optimal feature and threshold selection will

continue until those two requirements are met, at which point the remaining training
examples will be passed on to the next stage.

For example, if those parameters are set to
0.995 and 0.5 respectively, at each stage, feature selection and threshold optim
ization will
be applied until the resulting stage is
capable of classifying 99.5% of the
positive instances as positive and does not
classify more than 50% of the negative
images as positive. For more
precise
details regarding feature selection and
trainin
g, consult Viola and Jones [1].



In the extended model of the
classifier implemented in the OpenCV
C++ library, each stage of the classifier
can make use of more than one feature in
order to meet the requirements set by the
input parameters
, in which cas
e each stage
can be viewed as a

decision

tree, rather
than a

decision

stump

[3]
.
It is also
important to note that at each stage, the
classifier uses a different set of negative
training images which
are
sample
d

from a
given database of images that do not
contain the specified object.
After training the desired number of stages, t
he result is a
cascade of tree
-
like classifiers
, as show in Figure 2.






Figure 2: Schematic description of the classifier. At
each stage, the classifier either rejects the instance
(repres
ented as a sub
-
window from a given test image)
based on a given feature value or sends the instance
further down the tree for more processing. At the
initial stages a large number of negative examples are
eliminated [1].

The structure of the resulting classifier is essentially that of a degenerate decision
tr
ee or a decision list. Each added stage to the classifier tends to reduce the false positive
rate, but also reduces the detection rate

[3]
. As such, it is essential to train the classifier
with the appropriate number of stages for the
given
task.


Once a
classifier is trained, detection is done by sliding a window across an input
image and passing the cropped
sub
-
image

through the classifier. In order for
classification to be size
-
invariant, the same procedure is also performed on the input

integral image
at different scales.
Given this
scheme
, the output of classification is a
series of sub
-
windows of the test image which contain the desired object.
In the following
two sections we outline how this model was applied to the problems of traffic light and
car

detection.





IV
. Traffic Light Detection



The problem of traffic light detection is important in the area of driving assistive
technology. A system which is tasked with preventing accidents when a driver is not
paying attention must always know whethe
r there is a traffic light in the scene and what
its state is.



To
solve

this
task
, a Haar
c
ascade
classifier for traffic lights

was trained
. The
dataset used in these experiments consists of
real time video taken from a camcorder
position
ed

inside
a pa
ssenger car.
The
camera resolution is low (320 by 240) and so
is the image quality, thus adding another
challenge to this problem.


The classifier was trained with 5 stages on 120 positive examples, and 120 negative
examples.
The
minimum hit ratio

at each

stage is set
to
0.95 and the
maximum false
alarm rate

is set to 30%.



To improve results and decrease computation time, the area of the image being
searched through

is restricted to the portion where a tra
ffic light could actually occur


there
is no po
int at looking for that object o
n the
actual
road, for example.

Once a traffic
light is detected in the input stream, the image is analyzed to determine
its

state. Ideally,
we would want to identify the area of the traffic light which contains the actual c
olor
signal. In our case, however, the resolution was low enough such that the number of
pixels that actually correspond to the light in the traffic light is usually about 5 or 6 which
makes it quite difficult to analyze. Nevertheless, a simple scheme for
determining the
color of the light
is implemented
which works the following way:





Figure 3: Positive examples of
traffic lights used
for training
. Negative samples are randomly
selected from an image collection that does
include traffic lights.




Input:

cropped image of detected traffic light


Steps:

1. G = Sum the green components of the cropped image



2. R = Sum of the red components



3. If (G/R) < t
1
, then o
utput RED



4. If (G/R) > t
2
, then output GREEN



5. Else, output YELLOW



The thresholds t
1

and t
2

were automatically
optimized

based on a training set of
traffic light images. In practice this scheme worked well in determining the color of a
given light,

although if we had better resolution, it is conceivable that a much
better and
more robust
algorithm could be devised.



The classifier was evaluated on about 20 minutes of continuous input stream

recorded while driving in Ames, IA
.
The detection scheme
works quite comfortably in
real time due to the small size of the trained classifier and small area of the image that is
being
processed
.
In all the occasions on which a traffic light was passed, the classifier is
able to detect it and

almost always
output
s

the correct
color, so long it
is

green or red.
The lo
w
-
quality of the video input mak
e
s

it difficult to recognize yellow since the pixels
of the light signal actually assume white color in such cases.
Th
e good detection
performance

is likely due to the f
act the traffic light shape is very distinct and there are
almost no other objects present in the portion of the image that is being searched.
One
obvious drawback is that only traffic lights of this particular shape can be
detected


while most traffic li
ghts in Ames follow this standard, the same might not be true for
other cities.
Figure 4

shows some example results.
At the end of this paper there is a
discussion about some available online demos of this system
that

demonstrate

how it
works in practice.








Fig.4
. Example results from running the traffic light detection procedure





V
. Car Detection



In this particular problem, we are interested in detecting vehic
les in front of the
observer. A series of Haar cascade classifiers are trained and evaluated on two different
datasets.



The first dataset, as in the previous problem, consists of low
-
quality video taken
while driving in Ames and the surrounding area
s. T
he low quality, however, mak
e
s

it
difficult to detect objects further in the distance and as such a second dataset of good
quality images was used in order to evaluate the detection scheme in more detail.



5.1. Car Detection in low
-
quality and low
-
resolu
tion video stream



As in the previous section,
t
he experiments are

performed on a dataset comprising
of a recorded video from a camera installed
in a passenger car while driving in Ames. A
classifier with 10 stages is trained on
200
sample images of cars
taken from half the
amount of video available. The training parameters
minimum hit ratio

and
maximum false
alarm rate

are set to 0.995 and 0.3 respectively. The resulting classifier is tested on about
20 minutes of video recorded while driving on the freew
ay.


Once again, since the position of the
road relative to the observer is known in this
context, we are able to restrict the image
area in which a car is hypothesized to be.
Restricting the region of interest allows for
greater speed of computation and

for
elimination of false positives which could
not possibly be actual cars due to their
location.


Once the region of interest is
identified, it is scanned

by a widnow

at
different scales
, and any sub
-
windows which
are marked as positive by the Haar casc
ade
classifier are deemed to be detected cars.


Restricting the search area helps eliminate almost all false positives. A passing car
was always detected as such, although once it gets far ahead enough, the detection
scheme fails due to the small size
of
the object
and low
image
quality. Even though large
semi
-
trucks were not part of the training set, they generally tended to be recognized as
cars by the classifier
,

if close enough. The demos available
online

can give an accurate
illustration of how well t
his detection and classification scheme works.

Some sample
screenshots are included in Appendix I.

Overall, with a large data set and good quality
video stream, such system could be fairly robust although it will never be absolutely
perfect and hence an au
tonomous driving agent would need
a
much smarter framework in

Figure 7: Identifying regio
n of interest, and
performing detection with trained Haar
cascade classifier

order to detect vehicles on the road.

In
the next section, we evaluate this object
recognition and detection scheme much more precisely with mid
-

to good
-

quality input
data.





5.
2
. Car Detec
tion in mid
-

to good
-
quality images



The dataset used in the following
experiments consist of 526 images taken from
inside the driver seat of a vehicle, each of which
contains at least one car in front of the observer.
The images are not sequential fram
es from a video
feed.
Sample images from this dataset are shown
in Figure 5.
The dataset was split into 2/3 training
and 1/3 test sets. Overall, 300 sample images of
cars were extracted which were used for training
each classifier.




Knowing that detecti
on rate can decrease
as the number of stages in a classifier increase, our
task is to determine the optimal number of stages
for this given problem. The training parameters
minimum hit ratio

and
maximum false alarm rate

are set to 0.995 and 0.3 respectivel
y for all trained
classifiers. Following, classifier with number of
stages ranging from 5 to 10 are trained and
evaluated on the test set.



Evaluat
ion

is performed by running the detection scheme on the test set and
taking note of the type of results tha
t are outputted at each frame. Each

output

result falls
within one of three categories:
positive
,
negative
, or
partial
. Positive results are tho
se that
contain a car in a well
-
defined box. Negative results are such outputs that do not contain
any major dis
tinguishable portion of a car. Partial results contain everything in between


if the result contains a major portion of the car, or if it contains a car, but also lots of
other stuff, then it is labeled as partial.
Figure 6 shows examples of each type of
outputs.








(a)




(b)



(c)


Figure
6: Examples of a positive (a), negative (b) and a partial (c) detected object.



Figure
5: Samples from a car image
database.



Each trained classifier was tasked with detecting cars in the test set and
the
res
ulting outputs were
saved and
manually labeled as positive, negative or partial. Figure
7 shows the results of each run. As we can see from the chart, the 7
-
stage classifier
detects the highest number of cars in the test set, while the 10
-
stage

classifier
detects the
lowest false alarm rate
, as expected.



Figure 7: Summary of classifiers’ performance
.




The results of these experiments

illustrate

the tradeoff between the hit rate and the
false alarm rate of each classifier. Ideally, we want to detect a
s many
actual

appearances
of the target object in the input stream without reporting too many false positives.
In
both, driving assistive technology and autonomous driving applications, a false positive
error is not nearly as bad as a complete miss of an a
ctual object of interest.



Following, we explore an approach to boost the classifier in order to minimize the
false alarm rate whi
le maintaining a good hit ratio
. One such approach would be to
reinsert samples of false positive outputs into the training
set and further train the Haar
cascade classifier. Retraining the classifier, however, is
a
highly time
-
consuming
process
when compared to other machine learning techniques.
A 10
-
stage Haar cascade classifier,
for example, can take up to one hour train on
an average PC, even when faced with only a
small dataset of 300 positive
and 300 negative
samples.
If a real
-
time system is being told
by the user that some of its findings are false positives, it would not have the luxury of
time to adapt to those results
.



Our

approach to improving performance in real time utilizes an SVM which is
trained on labeled detected outputs resulting from running the Haar cascade classifier
detecti
on scheme. This technique proves

time
-
efficient and it improve
s

performance. A
go
od question at this point is why not use SVM from the very beginning?
An SVM
approach would likely yield better results than Haar cascade classification. However, we
note that it is difficult to efficiently search through an RGB image at different scales f
or
potential candidates.

If the SVM makes use of global and local features instead of the raw
pixel values, then there would be even extra computational overhead (in addition to
scaling the image) when sliding a window and looking for a match.

Real
-
time us
ability is
a requirement for any driving assistive technology or autonomous driving system. We
also have to note that object detection and recognition is only a small portion of such a
system and as such, we need an efficient algorithm which saves computat
ional resources
for other tasks such as object tracking and decision making.



W
e perform
experiment
s

to initially validate whether an SVM can be used to
distinguish between
positive

and negative results of the Haar cascade classifier and
determine what
image representation is best to use. The
set of 232

positive and
194
negative output samples from the 5
-
stage classifier
is

used as a dataset in this experiment.
Each sample is scaled to s
ize 15 by 15, converted to gray
-
sc
ale image, and un
dergoes
histogram

equalization. The equalized gray
-
scale image is used as the raw input, which
each attribute corresponding to a particular pixel with value of 0 to 255 scaled to a real
value between 0 and 1. We also perform
an experiment to see whether it is better
to use

the edges in the image as a
representation of the instances, rather
than the gray
-
scale image itself. Overall,
there are 225 attributes per instance (1
for each pixel), regardless of which
representation we use. Figure 8
illustrates

the way

instances are
preprocessed for input into the SVM
algorithm.




The experiment suggests that using the equalized gray
-
scale representation yields
better classification results. Using 5
-
fold cross
-
validation with a polynomial kernel SVM,
we can achieve 93
.8
% accuracy wh
ich is
illustrated in

the following confusion matrix:









predicated

positive

negative

23
7

5

positive

actual

16

178

negative







(a) (b)

(c)


Figure
5: Samples’ preprocessing: (a) gray scale, (b)
histogram equalization, (c) Canny edge detection.


Using the detected edges representation of the samples, on the other hand, yielded
accuracy of only 83%. Experiments were

also performed to determine the optimal scale
of the samples
and the results show that increasing the image dimensions beyond 15 by
15 does not produce a significant increase in accuracy, but as expected, slows down
training and testing due to the quadrat
ic increase of the number of attributes.



Following this
result
, we attempt to boost the 7
-
stage classifier by training and
evaluating an SVM on the dataset comprised of the Haar cascade classifier’s output
results.
The dataset contains 512 instances, of

which 250 are positive, 221 are partial, and
41 are negative.
We

conduct a 5
-
fold cross
-
validation
experiment with a multi
-
class SVM

with polynomial kernel of 4
th

degree and t
he result is the following confusion matrix:











All positive instances get classified as either positive or partial, while only a small
fraction of negative and partial instances get
s c
lassified as positive.
No positive instanc
e
is classified as negative, which is a highly desirable property in the applications discussed
previously.
The results are promising and show that boosting a Haar cascade classifier
with an SVM can increase performance. The boosted 7
-
stage Haar cascade cl
assifier is
clearly

superior to the 10
-
stage classifier in terms of quality of results.




VI.
Discussion



We have shown that a machine learning approach utilizing
a Haar cascade
classifier and an SVM can be an efficient and accurate method for performin
g object
detection in real time input video stream. While the detection rate achieved is not high
enough for an autonomous driving agent, the proposed
scheme

could be utilized within a
driving assistive technology
system
.
For demos of the currently develop
ed framework,
visit:

http://www.cs.iastate.edu/~jsinapov/Vision/



The question of whether boosting a Haar cascade classifier with an SVM is more
efficient than using SVM for detection itself still remains to be answered. Intuitively,
searching for an obj
ect within the image would be faster if using a Haar cascade classifier

for recognition
, but this hypothesis is yet to be validated.
Object recognition with SVM
and local features (such as SIFT features, for example) has been shown to have very high
perfor
mance, but it is still a question of whether localizing the target object in an input
image can be done efficiently if the features used for recognition are not easy to compute
[4].

predicted

positive

negative

partia
l

240

0

10

positive

actual

3

24

14

negative

26

3

192

partial


Ultimately, the goal is to design a system which can efficiently detect
an object in
the input video stream, as well as efficiently update its model of the target to be detected.
SVM is the most likely candidate to achieve this task,
as long as

we can implement an
efficient search

routine

through the input image. From this sta
ndpoint, we can view the
Haar cascade object detection scheme as a search technique which identifies the areas
most likely to contain the target we are looking for.
Once those candidates are localized,
they can be passed on to a stronger classifier which w
ould not only produce better results,
but also be able to adapt its model based on user feedback.
An alternative approach
would be to

still

use AdaBoost for feature selection during training but utilize SVM
directly instead of
constructing
a cascade of dec
ision trees. This has the potential to
combine the efficiency of Haar cascade classifier detection scheme with the
classification
power
and robustness
of the SVM.





References

[1]

Viola, P., and Jones, M., “Rapid Object Detection using a Boosted Cas
cade of
Simple Features,”
IEEE CVPR
, 2001.

[2]

Lienhart, R., and Maydt, J. “An Extended Set of Haar
-
like Features for Rapid
Object Detection,” Submitted to
ICIP2002
.

[3]

Bradski, G., Kaehler, A., Pisarevsky, V. "Learning
-
Based Computer Vision with
Intel's
Open Source Computer Vision Library." Intel Technology Journal. 2005.


[4]

Serre, T.
,


Wolf, L.
,


Poggio, T.
, “
Object Recognition with Features Inspired by
Visual Cortex”. Proceedings to

IEEE Computer Society Conference on

Computer
Vision and Pattern Recog
nition
, 2005.



Appendix I:

Sample results from performing car detection on real
-
time video input feed.