1
Automatic Visual Surveillance of Vehicles and People
P.Remagnino, S.Maybank, R. Fraile, K. Baker
Department of Computer Science, The University of Reading, Reading, RG6 6AY, UK
R.Morris
School of Computer Studies, University of Leeds, Leeds, LS2 9
JT, UK
Key words:
visual surveillance, scene interpretation, learning.
Abstract:
This paper presents three separate techniques to interpret scene dynamics.
Vehicle trajectories, pedestrian behaviours and the interactions between
vehicles and pedestrians a
re analysed using probabilistic frameworks. The
input is provided by an integrated vision system which tracks vehicles and
pedestrians in complex scenarios.
1. INTRODUCTION
The last decade has seen a large increase in the use of visual surveillance
camera
s. These are often installed in concourses, car park areas and high
security sites to monitor the flow of pedestrians and vehicles for security and
data analysis. The job of monitoring image sequences is usually assigned to
a human operator who waits for i
mportant events to occur. Operators rapidly
become bored and lose concentration. It is therefore essential to devise
autonomous surveillance systems which can interpret the images and alert a
human operator only when suspicious events occur.
This paper des
cribes three techniques which interpret the output of an
integrated vision system built for locating and tracking pedestrians and
vehicles in complex scenes. Individual trajectories of people and vehicles are
analysed and interpreted. Vehicle trajectories
are analysed and predicted
using hidden Markov models. Human trajectories are classified as standard
or atypical using a supervised learning technique. The vehicle and human
trajectories are also interpreted by Bayesian networks, and verbal
descriptions of
their motion dynamics and their trajectory trends are
2
produced. These descriptions allow a human observer to interpret the scene
quickly.
The vision system lends itself as a powerful tool to interpret relevant
scene events and to identify anomalous situ
ations.
2. PREVIOUS WORK
Over the last ten years vision researchers have provided a number of
solutions to the visual surveillance problem. Nagel's paper
[2]
on semantic
interpretation is a seminal work on the subject. Buxton a
nd Howarth
[5]
and
Buxton and Shaogong
[4]
introduced Bayesian networks, based on the work
of Pearl
[1]
, to detect the foci of attention of a dynamic scene and to provid
e
interpretations of traffic situations. Huang et al.
[3]
proposed a probabilistic
approach, similar to
[5]
, for the automatic visual surveillance of Californian
highways. Bogaert et al.
[9]
have worked on the surveillance of subways and
metropolitan stations to monitor vandalism. More recently Haag
[10]
proposed a temporal fuzzy logic to provide high level verbal descriptions for
traffic sce
nes. Bobick
[6]
proposed a novel approach for the description of
human activities. Brand et al.
[11]
[12]
described a coupled hidden Markov
model to interpret human inter
actions and perform visual surveillance.
The vision system presented here is unique in the literature as it builds
and automatically updates a 3D model of the imaged dynamic scene. Vehicle
trajectories are analysed by a novel approach which learns and pred
icts the
global trends in terms of local curve measurements. Pedestrian behaviour is
interpreted by a supervised learning technique which is complementary to
the one of Johnson
[16]
. Object dynamics and their interactions are a
nalysed
by Bayesian networks using a variation of the approach of Buxton and
Howarth
[5]
. Overall the system offers an integrated suite of visual modules.
It is richer than any other in the literature and it can automatically
infer
interpretations from a sequence of images.
3. THE INTEGRATED SY
STEM
The vision system integrates two independently developed software
modules, one for tracking vehicles
[17]
and the other for tracking pedestrians
[7]
. The system assumes a pre

calibration of a static camera which yields a
global coordinate system on the ground plane. Regions of relevant motion
are detected and assigned to software module according to their elongation
(horizontal
ly elongated shapes to the vehicle tracker, vertically elongated to
the pedestrian tracker). The vehicle tracker instantiates 3D wire

frame
3
models on the ground plane, while the pedestrian tracker makes use of
deformable two

dimensional shapes in the image
. These shapes are back

projected onto the 3D world assuming a standard person height, and a
cylinder is instantiated for each pedestrian as part of the 3D scene model.
The 3D model is kept up to date by both systems. Occlusions are handled by
making use o
f the scene geometry and the position of the camera. A 2D
depth map is used by the image processing routines to deal with occlusions
[7]
.
4. DESCRIPTION OF VE
HICLE TRAJECTORIES
The vehicle tracker obtains from each image frame
a measurement
of the position of the vehicle on the ground plane. Here
is a
measurement of the ground plane position directly below the mid

point of
the rear axle. A typical image sequence of a moving vehicle yi
elds 30 to 40
measurements, with one measurement taken every fourth frame. The time
between consecutive measurements is 0.16 s.
The sequence of measurements is divided into overlapping segments,
each containing 10 measurements, with adjacent segments havin
g 9
measurements in common. The measurements in each segment are
approximated by a continuous low curvature function. The low curvature
ensures that the steering angles associated with the continuous
approximation are physically realistsic. Steering angles
are usually low.
Even at full lock the steering angle is only about
. Let the measurements
in the segment of interest be
,
, and let
be the time of the
frame from which
is obtained. The approximating function
is found in two steps. The first step is to find the
degree two polynomial function
which minimises
(1)
The second step is to obt
ain a low curvature approximation to
by
minimising the functional
(2)
where
is an experimentally determined constant. The value
4
gave good results. The first term
on the right hand side of (2) ensures that
is close to
and the second term ensures that f has a low curvature.
The least speed and the greatest steering angle
of the car in the time
interval
are estimated from
and the segment assigned to one of the
four classes
(ahead, left, right, stop) using Table 1.
Table
1
.
Relation between classes and condi
tions
Class
Condition
a
l
r
s
The trajectory is thus reduced to a string of symbols drawn from
, one symbol for each segment.
The string is edited
using a hidden Markov model for the motion of the
car. The model has internal states
again corresponding to ahead,
left, right, stop. The states
are regarded as the time states of the
car, while
are observations which may be in error.
The Viterbi algorithm
[15]
is used to find the string of states for which
the observed string has the highest probability. The transition probabilities
for the HMM were assigned `by h
and', after the analysis of a set of 21 image
sequences of a car manoeuvring in a car park.
Figure
1
.
The trajectory
Figure
1
shows a trajectory from one of the experiments
[14]
. The
vehicle moves from top left to bottom right. The arrow represents the normal
to the trajectory pointing to the right hand side of the driver. The base points
of the arrows are the coordinates of the vehicle on each sample. The
5
sequence of extra
cted symbols is {a, a, a, a, a, a, a, a, l, l, l, l, l, a, a, a, l, l, a,
a, l, l, l} and the most likely sequence of states, as identified by the Viterbi
algorithm is {A, A ,A, A, A, A, A, A, L, L, L, L, L, L, L, L, L, L, L, L, L, L,
L}.
5. ANALYSIS OF
PEDESTRIAN BEHAVIOUR
Pedestrian behaviour is analysed by building a statistical model of the
trajectory. The aim here is to obtain high level descriptions of the behaviour
of the entire trajectory, classifying each trajectory as typical or atypical. In
par
ticular we are interested in the behaviour which occurs in car

park
situations. The model describes the instantaneous behaviour of the person
relative to an individual vehicle, and the combinations of interactions with
several vehicles. A few points on th
e trajectory are chosen as salient
features.
A principal design goal has been to construct a system which can
describe a wide variety of different behaviours. The geometry of the scene
changes over time as cars leave and enter the car park. People can take
many
possible routes through the car park, weaving between vehicles. Individual
trajectories will differ because a person is likely to start or finish the journey
at his or her particular car. This precludes the use of techniques which
compare a whole pat
h with paths which have occurred many times before
[16]
.
On each trajectory of a person, the points which are the closest to each
vehicle in the scene are selected as landmarks. Landmarks can be computed
in different ways:
1.
For
each vehicle the closest point on the trajectory to that object is
found. This yields one landmark for each object.
2.
All local minima of the function
are used, where
is the distance to vehicle
at time
.
3.
Combining methods 1 and 2, that is finding the global minima of the
distances to each vehicle, rejecting those for which some other object is
closer.
In method 1 some of the minima will correspond to vehicles which are
far a
way, and of little interest, but still affecting the characteristics of later
distributions, in particular making it sensitive to the number of cars in the
scene. In the second method measurement noise is a major problem. If the
speed is low there may be s
everal local minima for each object. Some of
these minima can be eliminated by smoothing out the curve, but this is still
likely to leave more than one minimum per object. This may be useful for
6
capturing the length of time two objects are close or for det
ecting repeated
interactions. The third method was chosen because it combines the good
features of the first two.
The identified landmarks are assigned values which correspond to the
speed of the person and the distance between the person and the vehicle.
In
current work the use of principal component analysis is being used to obtain
further characterisations of the local shape of the trajectory.
The landmark values are used to build a statistical distribution. This is
achieved by taking all the landmarks
on all the trajectories of a training set,
and calculating their speeds and distances.
These two quantities have definite orderings, i.e. low speed is more
noteworthy than high speed, and low distance is more noteworthy than high
distance. So the probabil
ity for a landmark with speed
, and distance
, is
calculated by simply counting the number of points with speed
and
distance
and dividing by the total number of points.
For
each trajectory the above procedure gives us a sequence of
landmarks, and hence an ordered sequence of probabilities. This sequence is
sorted in terms of increasing probabilities. This makes the representation
independent of the particular order the event
s occur in.
Figure
2
.
(a) typical and (b) atypical trajectories
Figure
2
(a) shows two normal trajectories, and the associated sequences of
7
sorted probabilities
[13]
. Here th
e first two values are low corresponding to
when the person stops near their own car. The other values in the sequence
rapidly increase.
Figure
2
(b) shows more atypical trajectories. Here there are
more values with low probabiliti
es.
A supervised learning technique is used to classify the sorted sequences.
Data are divided into two groups, a training set and a test set. Each set is
further classified by hand into typical and atypical trajectories. The training
set consisted of 5
9 trajectories: 54 typical and 5 atypical. The test set
consisted of 70 trajectories: 64 typical and 6 atypical. The weighted sum
, of the first five probabilities
in each sequence is used. If the
sum is greater
than 0.5 the trajectory is classified as being atypical. In the
training stage an exhaustive search is taken over all possible weights, each
weight taking values between 0 and 1 with a step increment of 0.2. Those
weights which correctly classify the most
trajectories (four or fewer miss

classifications) are chosen and the mean values of these weights are
computed. Trajectories are finally classified in the test set using the
calculated weights. In the run experiments all six atypical trajectories were
cor
rectly classified and 60 of the 64 typical trajectories were correctly
classified.
6. INTERPRETATION OF
OBJECT INTERACTIONS
Each pedestrian and vehicle is assigned a probabilistic agent, called a
behaviour agent, which is capable of interpreting their beha
viour in terms of
a description of motion dynamics and trajectory trends (regularity). When
two objects are in close proximity another probabilistic agent, called a
situation agent, interprets the interaction.
In the current incarnation
[8]
, an agent is a Bayesian network with a
semantic annotator for interpreting the output of the net. A Bayesian network
is a directed and acyclic graph, where the nodes represent clauses or events
and the arcs their causal relations. The model
was created and developed by
Pearl
[1]
. Conceptually the Bayesian network captures the qualitative and
quantitative nature of the underlying problem in a single and compact model.
The graph topology is a qualitative representat
ion of the causal relationships.
The model infers the most likely ‘explanation’ of the observations by
propagating evidence from the leaves to the root.
The Bayesian network used by the behaviour agent is shown in
Figure
3
.
The t
wo hidden nodes (DYN and TRAJ) identify an intermediate
interpretation of the object behaviour in terms of its dynamics (DYN) and its
trajectory (TRAJ). The DYN node tells whether the object is stationary or
moving slowly, with average speed or fast in a p
articular area of interest or
8
moving out of it. The links between location (LOC), heading (HD), speed
(SPEED) and the dynamics (DYN) define a set of fixed causal links. Each
link carries a conditional probability matrix which encodes a priori
knowledge ab
out the causal relationship. For instance the object trajectory
(TRAJ) is affected by both its acceleration (ACC) and curvature (CURV).
While curvature (CURV) simply encodes the regularity of the stretch, the
acceleration (ACC) node records whether the obj
ect is accelerating,
decelerating or travelling with constant speed. The root node represents the
behaviour (BEH) or attitude of the object. In essence the characteristics
described by the DYN and the TRAJ nodes are merged into a more compact
and meaningf
ul description. So, for instance, if the object is a pedestrian and
it is moving slowly in a field with a regular trajectory, this will be interpreted
as the pedestrian walking on the field. The behaviour agent produces the
most probable textual descriptio
n based on the object class and its behaviour
probability vector.
Figure
3
The behaviour agent
Figure
4
The situation agent
The situation agent creates a probabilistic conne
ction between two behaviour
agents assigned to two objects in close proximity.
Figure
4
shows the
network. The situation agent summarises the occurring events in terms of the
behaviours of the two objects involved (BEH1 and BEH2)
and their
directions of motion (DIRs). The behaviour nodes BEH1 and BEH2
represented in
Figure
4
are the root nodes of the behaviour agents. The DIRs
node records whether the two objects are actually heading towards, moving
away
from one another or moving along with non

interfering directions.
The present implementation only deals with pairwise interactions. The
authors are currently working on an extension to the system. The idea is to
model complex and prolonged situations with
many interactions with a
Markov model. Models will be learnt off line using training sets of similar
situations, and used on

line to interpret a specific situation.
Figure
5
and
Figure
6
show 2 frames
of a car park sequence taken with a
fixed camera. The figures show the 3D scene model superimposed onto the
image (wire

frames for vehicles, and cylinders for pedestrians). Behaviour
agents are assigned to all identified objects in the scene. Agent
9
inter
pretations appear as text written below each image.
Figure
5
displays
the behaviour interpretation for vehicles VEH1 and VEH2 and for pedestrian
PED1.
Figure
5
.
Frame 487
Figure
6
.
Frame 539
Figure
6
shows all behaviour interpretations and a situation generated by
the close proximity of vehicle VEH2 and pedestrian PED1.
Pedestrian PED2 seems close to vehicle VEH1, but this is partly an effect
of
perspective. The pedestrian is not close enough to trigger the creation of a
situation agent (a Euclidean threshold was set to 4 meters).
7. CONCLUSIONS
We have presented an integrated vision system for use in visual surveillance
problems. The system us
es the output of a vision system to analyse vehicle
trajectories using hidden Markov models, to learn the atypical behaviours of
pedestrians with a supervised technique, and to interpret the interactions
between pedestrian and vehicles using a Bayesian for
malism. A brief
account of some experimental results was provided. The authors are
currently working on extension of the system to learn more complex
situations.
REFERENCES
[1]
Judea Pearl, Probabilistic Reasoning Intelligent Systems: Networks of Plausible
Inference, Morgan Kaufmann, 1988.
[2]
H

H Nagel, From image sequences towards conceptual descriptions, Image and Vision
Computing, 6(2):59

74, 1988.
10
[3]
T. Huang, D.Koller, J.Malik, G. Ogasawara, B.Rao, S.Russel and J.Weber, Automatic
Symbolic Traffic Scene Analys
is Using Belief Networks, In Proceedings of 12th
National Conference on Artificial Intelligence, pages 966

972, 1994
[4]
H.Buxton and S.Gong, Visual surveillance in a dynamic and uncertain world, Artificial
Intelligence, 78(1

2):431

459, 1995.
[5]
H.Buxton and R.H
owarth, Situational description from image sequences, In AAAI
workshop on Integrating of Natural Language and Vision Processing, 1994.
[6]
A.F.Bobick, Computers seeing action, In Proceedings of British Machine Vision
Conference, volume 1, pages 13

22, 1996.
[7]
P.
Remagnino, A.Baumberg, T.Grove, T.Tan, D.Hogg, K.Baker and A.Worrall, An
integrated traffic and pedestrian model

based vision system, in Proceedings of British
Machine Vision Conference, pages 380

389, 1997.
[8]
P.Remagnino, T.Tan, K.Baker, Agent Orientated An
notation in Model Based Visual
Surveillance, in Proceedings of International Conference on Computer Vision, Bombay,
India, pages 857

862, 1998.
[9]
M.Bogaert, N. Chleq, P.Cornez, C.S.Regazzoni, A.Teschioni and M.Thonnat, The
PASSWORD project, In Proceedings of
International Conference on Image Processing,
pages 675

678, 1996.
[10]
M. Haag and H.

H. Nagel, Incremental Recognition of Traffic Sequences, In
Proceedings of the Workshop on Conceptual Description of Images, pages 1

20, 1998.
[11]
M. Brand, The inverse Hollywood
problem: from video to scripts and storyboards
via causal analysis, in Proceedings American Association of Artificial Intelligence,
Providence, RI (1997).
[12]
M. Brand, N. Oliver and A. Pentland, Coupled hidden Markov models for complex
action recognition. in
Proceedings of Computer Vision and Pattern Recognition, San
Juan, Puerto Rico (1997).
[13]
R. J. Morris, D. C. Hogg, Statistical models of object interaction, Proceedings of
IEEE Workshop on Visual Surveillance, Bombay 1998. pp 81

85
[14]
R.Fraile and S.J.Maybank,
Vehicle Trajectory Approximation and Classification,
submitted to the British Machine Vision Conference, 1998.
[15]
L.R.Rabiner and B.H.Juang, An introduction to hidden Markov models, IEEE
ASSP Magazine, Jan 1996, pp 4

17.
[16]
N. Johnson and D. C. Hogg, Learning th
e distribution of object trajectories for event
recognition, Image and Vision Computing, 14(8):609

615, August 1996.
[17]
G.D.Sullivan, Model

based vision for traffic scenes using the ground plane
constraint, in Terzopoulos and C. Brown (Eds), Real

time Compute
r Vision, in press.
Comments 0
Log in to post a comment