Tracking People from a Mobile Platform

embarrassedlopsidedAI and Robotics

Nov 14, 2013 (3 years and 7 months ago)

64 views

Tracking People from a Mobile Platform

David Beymer
1

and

Kurt Konolige
2

1

IBM Almaden Research Center

650 Harry Rd, San Jose, CA 95120,
USA

beymer@us.ibm.com

2

SRI International

333 Ravenswood Ave, Menlo Park, CA 94025,

USA


konolige@ai.sri.com

Abstract.

T
racking people from a moving platform is a useful skill for the coming
generation of service and human
-
interaction robots. It is also a challenging problem,
involving sensing, interpretation, planning, and control, all in a real time and dynamic
environme
nt. Techniques used in existing people
-
trackers that operate from fixed
locations, noting changes against a fixed background, are not applicable. Instead, we apply
a vision
-
based approach using real
-
time stereo and 3D reconstruction, that explicitly
mode
ls both foreground and background objects in an efficient manner. The main novelties
of our approach include (1) remapping the stereo disparities to an orthographic “occupancy
map”, which simplifies person modeling, and (2) updating a background occupancy

map
based on robot motion. The current version of our system, running on a Pioneer II mobile
robot, can follow people at up to 1.2 m/s in an indoor environment.

1 Introduction

As mobile robotic systems enter the workplace and consumer markets, their
ac
ceptance will be judged in part by how well they can interact with the humans
surrounding them. At a basic level robots must avoid running over people, of
course, but they must also deal with them
as people
, that is, recognize them, pay
attention to them,

follow them.

In this paper we concentrate on one of these basic tasks, following a single,
cooperative person in an indoor environment, using passive vision sensors.
1

There
is no attempt to engineer the setting, for example, to provide the person with a
radio beacon or IR emitter, or to have fixed sensors in the environment. Instead,
the robot must autonomously track and follow the person using its onboard vision
sensors and computation.

The tracking problem is difficult because people are flexible objec
ts, presenting
a dynamic appearance, and are difficult to model geometrically; plus, the
environment offers occlusions and distractions, including other people, that can
confuse the tracker. Another difficulty is in the real
-
time and dynamic nature of



1

A
n even more difficult task, which we do not consider, is
pursuit
, where the tracked
person is trying to evade the pursuer [1].

Tracking People from a Mobile Platform

235

the

task: detection and tracking must happen at reasonable frame rates to keep up
with a walking person, and sudden turns near the robot require very wide field
-
of
-
view sensors, or very quick focus of attention.

There are two general techniques in current use

for people
-
tracking from fixed
viewpoints:

1.

Intrinsic characteristics. Find a property that distinguishes people from
their environment, e.g., color or form ([2
-
4]).

2.

Foreground segmentation. Subtract a static background to isolate
foreground objects [5].

The most robust fixed systems use a combination of these techniques, with
foreground segmentation to extract objects of interest and reduce distractors, and
intrinsic characteristics to find and track people in the reduced space. But,
foreground segmenta
tion is considerably more difficult from a moving platform
than from a fixed viewpoint, because the background undergoes motion relative to
the platform.

Our approach to tracking from a mobile platform uses a real
-
time stereo system
to provide a continuo
us 3D view of the environment. The 3D information is
analyzed by a Geometry Engine, which reprojects it to an orthographic floor
-
plan
representation, and assigns and updates a background within the floor
-
plan. A
Tracking Engine models the foreground data

with a mixture of Gaussian blobs,
detects candidate people from the Gaussian tracks, assigns them a state that is
monitored with a Kalman filter, and feeds back information to help update the
background correctly.

The advantages of our system lie in the e
fficient use of 3D information.
Reprojection and Gaussian mixtures reduce the amount of data enormously, while
maintaining the essential spatial characteristics of the tracked objects.
Reprojection also enables efficient background updating from a moving

platform,
by translating and resampling according to robot motion. Using these techniques,
we can robustly track individuals moving at 1.2 m/s in a typical indoor
environment, using modest embedded PC computation.

In this paper, we first discuss related
work, and then we describe the overall
tracking system design, as well as the mobile robot and stereo sensor systems. The
next section describes the Geometry Engine algorithms, including real
-
time stereo
and reprojection, Gaussian modeling, and background
updating. Next, we
describe the Tracking Engine components, and typical output and performance of
the system. Finally, we discuss problems and future enhancements.

2
Related work

A large number of person tracking systems have been developed over the las
t
several years, and the majority of them use color, background subtraction, or
contour modeling [2
-
8]. In this section on related work, we concentrate on
systems that use stereo or Gaussian modeling, or track from moving platforms.

236

D. Beymer and K. Konolige

The main advantages of adding stereo to a person tracking system include (1)
segmentation, (2) locating objects in 3D, and (3) handling occlusion events. In
[8], den
se stereo is computed in real
-
time and used to segment the scene into
patches of slowly varying disparity. Background subtraction on dense stereo
disparities is performed in [2, 4, 5]. Finally, [6,7] have started fitting parametric
human models to dense
stereo maps, with the hope that stereo can help with self
-
occlusion of body parts. All of these systems assume a fixed viewpoint for
calculating the background. The only extant reference on tracking from a mobile
platform using stereo is [9], although th
ere has been one TV demonstration [10].
Both these systems used a coarse model of people to detect them in stereo range
images. Once detected, people were tracked by finding the best range image
candidate near the previous location. These systems were e
asily fooled by
distractors, in part because they did not attempt to model the background.

Gaussian functions have been used in person tracking systems for modeling
multimodal distributions and articulated human models. [3, 6] model human
bodies as connec
ted 3D Gaussian blobs. In the area of multi
-
hypothesis tracking,
[11] has used a “piecewise” Gaussian model to represent multimodality in
tracking state space, allowing the system to maintain a number of competing
tracking states. This is related to Cond
ensation tracking [12], which represents
multimodality non
-
parametrically with a set of samples in state space. In our
system, we are using Gaussian mixtures to model multiple people, with one
Gaussian per person.


Stereo

Calculation

Offline

Orthographic

Projection

Occupancy

Map

Foreground

Map

Estimated

Background

Detect


Motion

Planning

Motion


Control

project

Offline

Stereo

Calibration

Geometry E
ngine

Track
-

Kalman
filter

Tracking Engine

12 inch height
of camera

Gaussian
model


Fig.

1
.
Person
-
tracking system diagram. The upper dashed/shaded area shows
the Geometry Engine component, which is designed to compute a foreground
map containing candidate objects to track. The lower shaded area is the Tr
acking
Engine, which detects people objects and maintains the tracked state of an object.

Tracking People from a Mobile Platform

237

3

A Mobile People
-
Tracking System

Fig.

1

is an overall diagram of the system, showing stereo sensor processing,
occupancy map creation and updating, detection and tracking, and motion
planning and control. In this paper, we concentrate on the Geometry E
ngine,
which efficiently computes and maintains a foreground occupancy map from the
3D stereo information; and the Tracking Engine, which uses a Gaussian mixture
model to detect and track people. To ensure responsivity, we want to keep the
cycle time from

perception to robot motion commands at less than 100 ms (10
Hz).

3
.1
Robot and Stereo Hardware

We use a small Pioneer II mobile robot from ActivMedia, with an embedded PC
(400 MHz AMD K6
-
2) and an analog framegrabber. The stereo head is Videre
Design’s

STH
-
V1, an inexpensive, low power stereo head with analog CMOS
cameras [13], mounted in a fixed position on the robot. All processing is done on
the robot, which runs autonomously.

The optical parameters for the stereo head are chosen for close
-
range,
wide
angle work. A small baseline (8 cm) makes it easier to perform stereo matching
on objects less than 1 m from the robot. Wide
-
angle lenses give approximately a
70 degree FO. Ideally, we would like full 360 degree coverage, because
transverse motion
near the robot generates large angular velocities relative to the
cameras, out of the FOV.

3
.2 Stereo Processing

A wide
-
angle lens accentuates the problem of calibrating the stereo rig, because of
high lens distortion. We use the SRI Stereo Engine Softw
are [13] to calibrate the
rig, using two radial distortion parameters [14]. From the calibration, the system
computes an image warp for (1) removing radial distortion, and (2) rectifying the
left and right images to align epipolar lines for stereo matchin
g. The calibration
also generates the projection parameters that relate stereo disparity to distance.

A small baseline and large FOV mean that the stereo range resolution degrades
rapidly with distance: at 5 m it is 20 cm. Another problem is the stereo p
lacement,
on top of the robot but still only 12 inches above the floor. At a distance of 0.5 m,
only a person’s legs are visible.
2

Given these constraints, our working region is a
70 degree cone from 1 to 5 m from the robot. The control algorithms are a
djusted
to track a person at about 2 m from the robot, where the range accuracy is good.




2

We could tilt the camera up to get a better near
-
field view. For these initial experiments,
we chose a floor
-
parallel orientat
ion to make it easier to determine the floor
-
plan
reprojection.

238

D. Beymer and K. Konolige

Good spatial detail is important for detection and tracking, but there is a
tradeoff with computational demands. The stereo processing must take place in
substantiall
y less than 100 ms, to leave time for the Geometry and Tracking
Engines. A frame size of 160x120 was chosen, with a disparity search range of 16
(interpolated to 1/16 pixel); the SRI Stereo Engine algorithms take roughly 12 ms
on each frame [13].

4

Geom
etry Engine

The primary goal of the Geometry Engine is to determine a foreground
segmentation of the 3D data from stereo, and generate candidate person
hypotheses for the Tracking Engine. As mentioned in the Introduction, the
Geometry Engine makes efficie
nt use of the 3D data by reprojecting it to a floor
-
plan representation, called an
occupancy map
. Here we discuss the formation of
the occupancy map, and especially, how a background model is created and
maintained

4.1

Occupancy maps

To a first approximation,

standing or walking people are vertical cylinders with an
elliptical cross
-
section. These cylinders project to ellipses on the floor plane, and
are a simple and compact mathematical model of a candidate person object. Many
other objects in man
-
made envi
ronments have vertical orientation (e.g., walls,
doors), and a floor
-
plane representation is a good approximation of their spatial
structure, with a reduction in data size from full 3D representations.

We call the floor plan reprojection of 3D data an
oc
cupancy map
. Occupancy
maps divide the
X
-
Y

plane into a set of discrete vertical buckets (
Fig
.

2
). At
location
)
,
(
y
x
occ
, we accumulate all disparity pixels that land in the
corresponding bucket. G
iven a disparity image, we can compute the
corresponding occupancy map by generating the 3D point
)
,
,
(
Z
Y
X

for each
disparity at
)
,
,
(
disp
v
u
, then incrementing the value of
)
,
(
y
x
occ
.

Since closer objects appear la
rger in the image, having each disparity pixel
)
,
,
(
disp
v
u
contribute a fixed amount to
)
,
(
y
x
occ

favors closer objects. Thus, a
technique is needed to compensate for the depth of a pixel. Using the stereo
equation
disp
bf
Z

, where
b

and
f

are the baseline and focal length, one can
easily show that incrementing
)
,
(
y
x
occ

by
disp
disp
nom

compensates for range,
where the accumulation is 1 for pixels at disparity
nom
disp
. An exa
mple of an
occupancy map is shown in
Fig
.

3
. We always smooth the occupancy map to
compensate for image noise and stereo matching errors.

Tracking People from a Mobile Platform

239

Reprojection can be efficiently performed in a single step from the disparity
image by table lookup, without the need

to compute 3D points. For each possible
u, v
, and disparity
disp
, we need to store the coordinates (
x,y
) of the occupancy
map. These tables are 3D, indexed by
u, v,

and
disp
. Our tables are reasonable in
size (2.6 MB total), however, because of the red
uced image size.

The tables are generated offline once the camera calibration and working
volume are known. For each
u, v,

disp

triplet, we reconstruct the point
)
,
,
(
Z
Y
X

in 3D, and set a bit if the 3D point is inside the working volume. F
or interior
points, we map the X and Y coordinates to the proper bucket. At run time, we can
easily test to see if a disparity is in the working volume and immediately map it to
the occupancy map.

4.2

Background Updating

A background occupancy map is computed

on each cycle, and subtracted from the
occupancy map to generate a foreground map, eliminating much of the
distractions that tend to confuse a tracker. If the robot is stationary, the
background is computed by taking an average over the past 10 frames, a
s in a
fixed
-
viewpoint tracker [5].

Robot motion complicates the task of segmenting foreground objects, as both
the foreground and background objects are in motion. Thus, a static background
image cannot be used. Also, person motion and robot motion ar
e confounded, as
the motion seen by the camera is a combination of the two. Still, using odometry
feedback from the robot, we can predict an expected background map at frame N
from the background map at frame N
-
1, as follows (see
Fig
.

3
).



Fig
.

2

Fl oor
-
pl an rep
roject i on t o an occupancy map. A di spari t y i mage from
st ereo generat es a poi nt cl oud, whi ch i s project ed vert i cal l y ont o t he f l oor pl ane.

240

D. Beymer and K. Konolige

At the end of processing frame N
-
1, we comput
e the background occupancy
map

)
,
(
)
,
(
)
,
(
1
1
1
y
x
G
y
x
occ
y
x
B
N
N
N






where
)
,
(
1
y
x
occ
N

is the occupancy map generated by projecting all stereo pixels
onto the floor plane, and
)
,
(
1
y
x
G
N


is the Gaussian representing the person’s
occupancy (see th
e next section). By subtracting out the occupancy belonging to
the person, we are left with the background visible in frame N
-
1. Next, in frame
N, given the robot motion in the floor plane, we can predict background
occupancy

))
,
(
(
)
,
(
1
y
x
B
y
x
B
N
N
predict
T



wher
e
)
,
(
y
x
T

is a similarity transform in the floor plane that compensates for the
known robot motion from frame N
-
1 to frame N. This transforms all of the visible
stereo data in frame N
-
1 to frame N, and provides a good guess of a background

image. Finally, the foreground occupancy map is computed

)
,
(
)
,
(
)
,
(
y
x
B
y
x
occ
y
x
F
N
predict
N
N



where
)
,
(
y
x
F
N

ideally contains disparity data only from the foreground object
being tracked. In practice, it will also contain background data that is just
becoming visible in frame N. But this extraneous data should not distract the
tracker as long as this new background data is spatially separated from the tracked
person (in the floor plane).


Fig
.

3

Background and foreground. The input image and stereo disparity are on
the left. The top middle shows the occupancy map computed by reprojection.
At top right is the background image, formed
by translating the previous frame’s
b慣kground. Th攠 m楤d汥l bo瑴tm is the 䝡dss楡n b汯b of 瑨e 瑲慣k敤 p敲sonI
wh楬攠th攠汯w敲 righ琠is 瑨攠oc捵p慮cy res楤u攠af瑥t b慣kground 慮d p敲son hav攠
b敥n 慣捯un瑥t for.

Tracking People from a Mobile Platform

241

5

Geometry Engine

The Tracking Engine determines candidate obj
ects to track from the foreground
map, by modeling people as Gaussian blobs in a mixture model. It also
implements a decision procedure for tracking, and updates a Kalman filter model
of the tracked person. We first describe the Gaussian model, then the

decision
-
making process.

5
.1

Gaussian mixture models

Since people map to Gaussian blobs in the occupancy map, we have explored
using a Gaussian mixture model to approximate the occupancy map
)
,
(
y
x
occ
.
Each individual being tracked will be
represented by a single Gaussian in the
mixture. Two mixture models have been explored for representing occupancy.
Let the mixture model
G

be composed of
n

Gaussians
i
G
, where

2
2
2
]
)
(
)
[(
)
,
,
,
;
,
(
i
i
y
y
i
x
x
e
s
s
y
x
y
x
G
i
i
i
i
i
i







.

For the sake of brevity we will
often omit the parameters
i
i
i
i
s
y
x

and
,
,
,
and
write
i
G

as
)
,
(
y
x
G
i
. We have explored:

1.

Additive mixture model
:
)
,
(
)
,
(
1
y
x
G
y
x
G
n
i
i




2.

Competitive mixture model
:
)
,
(
Max
)
,
(
1
y
x
G
y
x
G
i
n
i



The main difference betwee
n the models is that the competitive model explicitly
incorporates segmentation of the occupancy map. That is, a Gaussian
i
G

(i.e.
person
i

being tracked) claims the pixel set

}
),
,
(
)
,
(
|
)
,
{(
i
j
y
x
G
y
x
G
y
x
j
i


.



)
,
(
y
x
occ
)
,
(
y
x
G
|
)
,
(
)
,
(
|
y
x
G
y
x
occ


Fig.

4
:
The measurement process in the tracker approximates
occ(x,y)

with a
mixture of Gaussians
G(x,y)
.

242

D. Beymer and K. Konolige

The additive model, on the other hand, is more liberal and allows multiple
Gaussians to explain occupancy at a given
).
,
(
y
x

We found that when two people
are close to one another, this additive mixing reduced the Gaussians’ ability to
lock onto individual modes of
)
,
(
y
x
occ
, although the overall approximation was
good. The competitive model was better a
ble to track the modes of
)
,
(
y
x
occ

for
groups of people.

We fit a competitive mixture model to the occupancy map by iteratively
performing segmentation and parameter update steps. For
n

people, the
measurement process is:

Iterate until Gaus
sian parameters settle

For each person
i:

segment:


}
),
,
(
)
,
(
|
)
,
{(
)
,
(
i
j
y
x
G
y
x
G
y
x
y
x
mask
j
i
i




update

i
G
:
minimize


2
))
,
(
)
,
(
(
y
x
G
y
x
occ
i
mask
i



using one step of Newton
-
Raphson iteration.


An

example of this fitting process is shown in Figure 5. In this
scheme, the
system detects new people by looking for local maxima in
))
,
(
)
,
(
(
y
x
G
y
x
occ


that are above a threshold. That is, it attempts to model any unexplained residual
in the foreground occupancy map as a new person object (see
Fig
.

3
, lower right).

5.2

Detection and tracking

The decision method implemented by the Tracking Engine is outlined in
F
ig.

5
.
On startup, it waits and acquires a background occupancy map, then
starts looking
for Gaussian blobs in the occupancy foreground map. Each such blob is assumed
to be a person
-
type object, and the Tracking Engine tracks their movement in the
occupancy map, assigning them a state vector. The main part of a person’s state
vector is the vector
)
,
,
,
(
i
i
i
i
s
y
x

parameterizing the Gaussian. In addition, the
person’s position
)
,
(
i
i
y
x

is tracked with a Kalman filter; the Kalman filter
maintains position and velocity
)
,
,
,
(
i
i
i
i
y
x
y
x



and assumes a si
mple constant
velocity motion model.

The Tracking Engine follows a simple Kalman filter framework with the
Gaussian mixture modeling as the “measurement” process. For each person it
predicts the expected location
)
,
(
i
i
y
x

in the occupancy m
ap; this location is used
as initial conditions for the Gaussian fitting process. Next, it updates the Gaussian
locations based on the foreground map. Finally, it performs a Kalman filter
update on
)
,
,
,
(
i
i
i
i
y
x
y
x


.

Tracking People from a Mobile Platform

243

In
Detect and Track

mode, the Tracking Engine examines all object tracks.
The presence of many tracks causes it to go into the background update state,
where it reacquires a stabl
e background. If there is just a single track, then the
Tracking Engine assumes that is the person to track, and goes into the
Motion and
Track

state. Here it “locks on” to the single track, and uses the background
estimation technique described in Secti
on
Error! Reference source not found.

to
continuously update the background while moving.

6

Results and Conclusion

We have implemented the full system, but have not yet done systematic
experimentation, so we can p
resent only anecdotal results at this point. The whole
system has a cycle time of under 100 ms in all states. With simple motion control,
the robot will follow a person at up to 1.2 m/s, a fast walk. Outdoors, with few
distractors, we have tracked peopl
e for over 15 minutes at a time without an error.
The biggest problem is fast transverse motion near the robot: better FOV, perhaps
with multiple stereo devices, would help here. Indoors, we can reliably track
around an office environment, even through do
orways and offices with clutter. At
times the robot will fixate on a non
-
person object, for example, it tends to like the
larger ficus trees in our office. These errors show that background updating is not
completely reliable in removing distractors, and

we must supplement the

Background
Computation

Detect and
Track

Motion and
Track

still
tracking

no
tracks

single stable
track

multiple
stable
tracks

lost track

background
finished

loop for
~1 sec

Finite State Machine
for person tracker


F
ig.

5

Decision procedure for the Tracking Engine
.

244

D. Beymer and K. Konolige

Geometry Engine with other methods for tracking, for example, image template
correlation.Finally, simple motion control is not sufficient, since the robot must
also avoid obstacles and pursue the person when occluded in turning a co
rner. We
have tried using the recent gradient method [15] for following while avoiding
obstacles, but have found that, with a fixed camera mount, we cannot turn to avoid
obstacles and still track the person. Either a panning camera or wider FOV is
necess
ary.The main contributions of this paper are the introduction of the
occupancy map, background updating while moving, and the use of Gaussian
mixture models for tracking people.

Re
ferences

[1]

LaValle, S., D. Lin, L.J. Guibas, J.C. Latombe, and R. Motwan
i. Finding an
Unpredictable Target in a Workspace with Obstacles. Proc. 1997 IEEE ICRA.

[2]

Haritaoglu, I., D. Harwood, and L.S. Davis.
W
4
S: A Real
-
Time System for
Detecting and Tracking People in 2 1/2D
. in
European Conference on
Computer Vision
. 1998. Fr
eiburg, Germany.

[3]

Wren, C., et al.,
Pfinder: Real
-
Time Tracking of the Human Body.

IEEE
Transactions of Pattern Analysis and Machine Intelligence, 1997.
19
(7).

[4]

Beymer, D. and K. Konolige.
Real
-
Time Tracking of Multiple People Using
Stereo
. in
IEEE F
rame Rate Workshop
. 1999. Corfu, Greece.

[5] Eveland, C., K. Konolige, and R. C. Bolles. Background modeling for
segmentation of video
-
rate stereo sequences.
CVPR98
, Santa Barbara, CA.

[6]

Jojic, N., M. Turk, and T.S. Huang.
Tracking Self
-
Occluding Articu
lated
Objects in Dense Disparity Maps
. in
ICCV

1999. Kerkyra, Greece.

[7]

Lin, M.H.
Tracking Articulated Objects in Real
-
Time Range Image
Sequences
. in
ICCV

1999. Kerkyra, Greece.

[8]

Darrell, T., G. Gordon, and M. Harville.
Integrated person tracking usin
g
stereo, color, and pattern detection
. in
CVPR

1998. Santa Barbara, California.

[9]

Huber, E. and D. Kortenkamp. Using stereo vision to pursue moving agents
with a mobile robot.
ICRA,
1995.

[10]

Robots Alive!

Scientific American Frontiers television sh
ow on SRI’s
FLAKEY robot, October 1996.

[11]

Cham, T.
-
J. and J.M. Rehg.
A Multiple Hypothesis Approach to Figure
Tracking
. in
IEEE Conference on Computer Vision and Pattern Recognition
.
1999. Fort Collins, CO, pp. 239
-
245.

[12]

Isard, M. and A. Blake.
Cont
our Tracking by Stochastic Propagation of
Conditional Density
. in
European Conference on Computer Vision
. 1996.

[13]

Konolige, K.
Small Vision Systems: Hardware and Implementation
. in
Eighth
International Symposium on Robotics Research
. 1997. Hayama, Japa
n.

[14]

Tsai, R.Y.,
A Versatile Camera Calibration Technique for High
-
Accuracy 3D
Machine Vision Metrology Using Off
-
the
-
Shelf TV Cameras and Lenses.

IEEE JRA, 1987.
RA
-
3
(4).

[15]

K. Konolige. A Gradient Method for Realtime Robot Control.
IROS
, 2000.