Automatic Visual Surveillance of Vehicles and People

lynxherringAI and Robotics

Oct 18, 2013 (4 years and 6 months ago)



Automatic Visual Surveillance of Vehicles and People

P.Remagnino, S.Maybank, R. Fraile, K. Baker

Department of Computer Science, The University of Reading, Reading, RG6 6AY, UK


School of Computer Studies, University of Leeds, Leeds, LS2 9

Key words:

visual surveillance, scene interpretation, learning.


This paper presents three separate techniques to interpret scene dynamics.
Vehicle trajectories, pedestrian behaviours and the interactions between
vehicles and pedestrians a
re analysed using probabilistic frameworks. The
input is provided by an integrated vision system which tracks vehicles and
pedestrians in complex scenarios.


The last decade has seen a large increase in the use of visual surveillance
s. These are often installed in concourses, car park areas and high
security sites to monitor the flow of pedestrians and vehicles for security and
data analysis. The job of monitoring image sequences is usually assigned to
a human operator who waits for i
mportant events to occur. Operators rapidly
become bored and lose concentration. It is therefore essential to devise
autonomous surveillance systems which can interpret the images and alert a
human operator only when suspicious events occur.

This paper des
cribes three techniques which interpret the output of an
integrated vision system built for locating and tracking pedestrians and
vehicles in complex scenes. Individual trajectories of people and vehicles are
analysed and interpreted. Vehicle trajectories
are analysed and predicted
using hidden Markov models. Human trajectories are classified as standard
or atypical using a supervised learning technique. The vehicle and human
trajectories are also interpreted by Bayesian networks, and verbal
descriptions of

their motion dynamics and their trajectory trends are

produced. These descriptions allow a human observer to interpret the scene

The vision system lends itself as a powerful tool to interpret relevant
scene events and to identify anomalous situ


Over the last ten years vision researchers have provided a number of
solutions to the visual surveillance problem. Nagel's paper

on semantic
interpretation is a seminal work on the subject. Buxton a
nd Howarth

Buxton and Shaogong

introduced Bayesian networks, based on the work
of Pearl
, to detect the foci of attention of a dynamic scene and to provid
interpretations of traffic situations. Huang et al.

proposed a probabilistic
approach, similar to
, for the automatic visual surveillance of Californian
highways. Bogaert et al.

have worked on the surveillance of subways and
metropolitan stations to monitor vandalism. More recently Haag

proposed a temporal fuzzy logic to provide high level verbal descriptions for
traffic sce
nes. Bobick

proposed a novel approach for the description of
human activities. Brand et al.

described a coupled hidden Markov
model to interpret human inter
actions and perform visual surveillance.

The vision system presented here is unique in the literature as it builds
and automatically updates a 3D model of the imaged dynamic scene. Vehicle
trajectories are analysed by a novel approach which learns and pred
icts the
global trends in terms of local curve measurements. Pedestrian behaviour is
interpreted by a supervised learning technique which is complementary to
the one of Johnson
. Object dynamics and their interactions are a
by Bayesian networks using a variation of the approach of Buxton and
. Overall the system offers an integrated suite of visual modules.
It is richer than any other in the literature and it can automatically
interpretations from a sequence of images.


The vision system integrates two independently developed software
modules, one for tracking vehicles

and the other for tracking pedestrians
. The system assumes a pre
calibration of a static camera which yields a
global coordinate system on the ground plane. Regions of relevant motion
are detected and assigned to software module according to their elongation
ly elongated shapes to the vehicle tracker, vertically elongated to
the pedestrian tracker). The vehicle tracker instantiates 3D wire


models on the ground plane, while the pedestrian tracker makes use of
deformable two
dimensional shapes in the image
. These shapes are back
projected onto the 3D world assuming a standard person height, and a
cylinder is instantiated for each pedestrian as part of the 3D scene model.
The 3D model is kept up to date by both systems. Occlusions are handled by
making use o
f the scene geometry and the position of the camera. A 2D
depth map is used by the image processing routines to deal with occlusions


The vehicle tracker obtains from each image frame
a measurement

of the position of the vehicle on the ground plane. Here

is a
measurement of the ground plane position directly below the mid
point of
the rear axle. A typical image sequence of a moving vehicle yi
elds 30 to 40
measurements, with one measurement taken every fourth frame. The time
between consecutive measurements is 0.16 s.

The sequence of measurements is divided into overlapping segments,
each containing 10 measurements, with adjacent segments havin
g 9
measurements in common. The measurements in each segment are
approximated by a continuous low curvature function. The low curvature
ensures that the steering angles associated with the continuous
approximation are physically realistsic. Steering angles

are usually low.
Even at full lock the steering angle is only about
. Let the measurements
in the segment of interest be
, and let

be the time of the
frame from which

is obtained. The approximating function

is found in two steps. The first step is to find the
degree two polynomial function

which minimises


The second step is to obt
ain a low curvature approximation to

minimising the functional



is an experimentally determined constant. The value


gave good results. The first term

on the right hand side of (2) ensures that

is close to

and the second term ensures that f has a low curvature.

The least speed and the greatest steering angle

of the car in the time

are estimated from
and the segment assigned to one of the
four classes

(ahead, left, right, stop) using Table 1.


Relation between classes and condi







The trajectory is thus reduced to a string of symbols drawn from
, one symbol for each segment.

The string is edited

using a hidden Markov model for the motion of the
car. The model has internal states
again corresponding to ahead,
left, right, stop. The states

are regarded as the time states of the
car, while

are observations which may be in error.

The Viterbi algorithm

is used to find the string of states for which
the observed string has the highest probability. The transition probabilities
for the HMM were assigned `by h
and', after the analysis of a set of 21 image
sequences of a car manoeuvring in a car park.


The trajectory


shows a trajectory from one of the experiments
. The
vehicle moves from top left to bottom right. The arrow represents the normal
to the trajectory pointing to the right hand side of the driver. The base points
of the arrows are the coordinates of the vehicle on each sample. The


sequence of extra
cted symbols is {a, a, a, a, a, a, a, a, l, l, l, l, l, a, a, a, l, l, a,
a, l, l, l} and the most likely sequence of states, as identified by the Viterbi
algorithm is {A, A ,A, A, A, A, A, A, L, L, L, L, L, L, L, L, L, L, L, L, L, L,


Pedestrian behaviour is analysed by building a statistical model of the
trajectory. The aim here is to obtain high level descriptions of the behaviour
of the entire trajectory, classifying each trajectory as typical or atypical. In
ticular we are interested in the behaviour which occurs in car
situations. The model describes the instantaneous behaviour of the person
relative to an individual vehicle, and the combinations of interactions with
several vehicles. A few points on th
e trajectory are chosen as salient

A principal design goal has been to construct a system which can
describe a wide variety of different behaviours. The geometry of the scene
changes over time as cars leave and enter the car park. People can take

possible routes through the car park, weaving between vehicles. Individual
trajectories will differ because a person is likely to start or finish the journey
at his or her particular car. This precludes the use of techniques which
compare a whole pat
h with paths which have occurred many times before

On each trajectory of a person, the points which are the closest to each
vehicle in the scene are selected as landmarks. Landmarks can be computed
in different ways:



each vehicle the closest point on the trajectory to that object is
found. This yields one landmark for each object.


All local minima of the function

are used, where

is the distance to vehicle

at time


Combining methods 1 and 2, that is finding the global minima of the
distances to each vehicle, rejecting those for which some other object is

In method 1 some of the minima will correspond to vehicles which are
far a
way, and of little interest, but still affecting the characteristics of later
distributions, in particular making it sensitive to the number of cars in the
scene. In the second method measurement noise is a major problem. If the
speed is low there may be s
everal local minima for each object. Some of
these minima can be eliminated by smoothing out the curve, but this is still
likely to leave more than one minimum per object. This may be useful for

capturing the length of time two objects are close or for det
ecting repeated
interactions. The third method was chosen because it combines the good
features of the first two.

The identified landmarks are assigned values which correspond to the
speed of the person and the distance between the person and the vehicle.

current work the use of principal component analysis is being used to obtain
further characterisations of the local shape of the trajectory.

The landmark values are used to build a statistical distribution. This is
achieved by taking all the landmarks
on all the trajectories of a training set,
and calculating their speeds and distances.

These two quantities have definite orderings, i.e. low speed is more
noteworthy than high speed, and low distance is more noteworthy than high
distance. So the probabil
ity for a landmark with speed
, and distance
, is
calculated by simply counting the number of points with speed


and dividing by the total number of points.


each trajectory the above procedure gives us a sequence of
landmarks, and hence an ordered sequence of probabilities. This sequence is
sorted in terms of increasing probabilities. This makes the representation
independent of the particular order the event
s occur in.


(a) typical and (b) atypical trajectories

(a) shows two normal trajectories, and the associated sequences of


sorted probabilities
. Here th
e first two values are low corresponding to
when the person stops near their own car. The other values in the sequence
rapidly increase.
(b) shows more atypical trajectories. Here there are
more values with low probabiliti

A supervised learning technique is used to classify the sorted sequences.
Data are divided into two groups, a training set and a test set. Each set is
further classified by hand into typical and atypical trajectories. The training
set consisted of 5
9 trajectories: 54 typical and 5 atypical. The test set
consisted of 70 trajectories: 64 typical and 6 atypical. The weighted sum

, of the first five probabilities

in each sequence is used. If the
sum is greater
than 0.5 the trajectory is classified as being atypical. In the
training stage an exhaustive search is taken over all possible weights, each
weight taking values between 0 and 1 with a step increment of 0.2. Those
weights which correctly classify the most
trajectories (four or fewer miss
classifications) are chosen and the mean values of these weights are
computed. Trajectories are finally classified in the test set using the
calculated weights. In the run experiments all six atypical trajectories were
rectly classified and 60 of the 64 typical trajectories were correctly



Each pedestrian and vehicle is assigned a probabilistic agent, called a
behaviour agent, which is capable of interpreting their beha
viour in terms of
a description of motion dynamics and trajectory trends (regularity). When
two objects are in close proximity another probabilistic agent, called a
situation agent, interprets the interaction.

In the current incarnation
, an agent is a Bayesian network with a
semantic annotator for interpreting the output of the net. A Bayesian network
is a directed and acyclic graph, where the nodes represent clauses or events
and the arcs their causal relations. The model
was created and developed by
. Conceptually the Bayesian network captures the qualitative and
quantitative nature of the underlying problem in a single and compact model.
The graph topology is a qualitative representat
ion of the causal relationships.
The model infers the most likely ‘explanation’ of the observations by
propagating evidence from the leaves to the root.

The Bayesian network used by the behaviour agent is shown in
The t
wo hidden nodes (DYN and TRAJ) identify an intermediate
interpretation of the object behaviour in terms of its dynamics (DYN) and its
trajectory (TRAJ). The DYN node tells whether the object is stationary or
moving slowly, with average speed or fast in a p
articular area of interest or

moving out of it. The links between location (LOC), heading (HD), speed
(SPEED) and the dynamics (DYN) define a set of fixed causal links. Each
link carries a conditional probability matrix which encodes a priori
knowledge ab
out the causal relationship. For instance the object trajectory
(TRAJ) is affected by both its acceleration (ACC) and curvature (CURV).
While curvature (CURV) simply encodes the regularity of the stretch, the
acceleration (ACC) node records whether the obj
ect is accelerating,
decelerating or travelling with constant speed. The root node represents the
behaviour (BEH) or attitude of the object. In essence the characteristics
described by the DYN and the TRAJ nodes are merged into a more compact
and meaningf
ul description. So, for instance, if the object is a pedestrian and
it is moving slowly in a field with a regular trajectory, this will be interpreted
as the pedestrian walking on the field. The behaviour agent produces the
most probable textual descriptio
n based on the object class and its behaviour
probability vector.


The behaviour agent


The situation agent
The situation agent creates a probabilistic conne
ction between two behaviour
agents assigned to two objects in close proximity.

shows the
network. The situation agent summarises the occurring events in terms of the
behaviours of the two objects involved (BEH1 and BEH2)

and their
directions of motion (DIRs). The behaviour nodes BEH1 and BEH2
represented in

are the root nodes of the behaviour agents. The DIRs
node records whether the two objects are actually heading towards, moving
from one another or moving along with non
interfering directions.

The present implementation only deals with pairwise interactions. The
authors are currently working on an extension to the system. The idea is to
model complex and prolonged situations with

many interactions with a
Markov model. Models will be learnt off line using training sets of similar
situations, and used on
line to interpret a specific situation.



show 2 frames

of a car park sequence taken with a
fixed camera. The figures show the 3D scene model superimposed onto the
image (wire
frames for vehicles, and cylinders for pedestrians). Behaviour
agents are assigned to all identified objects in the scene. Agent


pretations appear as text written below each image.

the behaviour interpretation for vehicles VEH1 and VEH2 and for pedestrian


Frame 487


Frame 539


shows all behaviour interpretations and a situation generated by
the close proximity of vehicle VEH2 and pedestrian PED1.

Pedestrian PED2 seems close to vehicle VEH1, but this is partly an effect
perspective. The pedestrian is not close enough to trigger the creation of a
situation agent (a Euclidean threshold was set to 4 meters).


We have presented an integrated vision system for use in visual surveillance
problems. The system us
es the output of a vision system to analyse vehicle
trajectories using hidden Markov models, to learn the atypical behaviours of
pedestrians with a supervised technique, and to interpret the interactions
between pedestrian and vehicles using a Bayesian for
malism. A brief
account of some experimental results was provided. The authors are
currently working on extension of the system to learn more complex



Judea Pearl, Probabilistic Reasoning Intelligent Systems: Networks of Plausible
Inference, Morgan Kaufmann, 1988.


H Nagel, From image sequences towards conceptual descriptions, Image and Vision
Computing, 6(2):59
74, 1988.



T. Huang, D.Koller, J.Malik, G. Ogasawara, B.Rao, S.Russel and J.Weber, Automatic
Symbolic Traffic Scene Analys
is Using Belief Networks, In Proceedings of 12th
National Conference on Artificial Intelligence, pages 966
972, 1994


H.Buxton and S.Gong, Visual surveillance in a dynamic and uncertain world, Artificial
Intelligence, 78(1
459, 1995.


H.Buxton and R.H
owarth, Situational description from image sequences, In AAAI
workshop on Integrating of Natural Language and Vision Processing, 1994.


A.F.Bobick, Computers seeing action, In Proceedings of British Machine Vision
Conference, volume 1, pages 13
22, 1996.


Remagnino, A.Baumberg, T.Grove, T.Tan, D.Hogg, K.Baker and A.Worrall, An
integrated traffic and pedestrian model
based vision system, in Proceedings of British
Machine Vision Conference, pages 380
389, 1997.


P.Remagnino, T.Tan, K.Baker, Agent Orientated An
notation in Model Based Visual
Surveillance, in Proceedings of International Conference on Computer Vision, Bombay,
India, pages 857
862, 1998.


M.Bogaert, N. Chleq, P.Cornez, C.S.Regazzoni, A.Teschioni and M.Thonnat, The
PASSWORD project, In Proceedings of

International Conference on Image Processing,
pages 675
678, 1996.


M. Haag and H.
H. Nagel, Incremental Recognition of Traffic Sequences, In
Proceedings of the Workshop on Conceptual Description of Images, pages 1
20, 1998.


M. Brand, The inverse Hollywood

problem: from video to scripts and storyboards
via causal analysis, in Proceedings American Association of Artificial Intelligence,
Providence, RI (1997).


M. Brand, N. Oliver and A. Pentland, Coupled hidden Markov models for complex
action recognition. in

Proceedings of Computer Vision and Pattern Recognition, San
Juan, Puerto Rico (1997).


R. J. Morris, D. C. Hogg, Statistical models of object interaction, Proceedings of
IEEE Workshop on Visual Surveillance, Bombay 1998. pp 81


R.Fraile and S.J.Maybank,
Vehicle Trajectory Approximation and Classification,
submitted to the British Machine Vision Conference, 1998.


L.R.Rabiner and B.H.Juang, An introduction to hidden Markov models, IEEE
ASSP Magazine, Jan 1996, pp 4


N. Johnson and D. C. Hogg, Learning th
e distribution of object trajectories for event
recognition, Image and Vision Computing, 14(8):609
615, August 1996.


G.D.Sullivan, Model
based vision for traffic scenes using the ground plane
constraint, in Terzopoulos and C. Brown (Eds), Real
time Compute
r Vision, in press.