Research Plan - People Ee Ethz

linksnewsAI and Robotics

Oct 18, 2013 (4 years and 6 months ago)


Research Plan

Human Computer Interaction


Michael Van den Bergh

Department of Electrical Engineering


Prof. Dr. Luc Van Gool

Computer Vision Laboratory, BIWI

Department of Electrical Engineering, ETH

1. Research Topic

The topic o
f this work is the visual interaction between a human and a computer or
machine, through posture, hand gestures, pointing gestures and other visual cues. This
can be either a standalone system, or an additional system to make another (e.g. speech
based) sy
stem more robust. The system is designed to be unobtrusive (marker
less) and
fast enough (real
time) to provide a smooth user interaction.

Such systems are useful as computer/multimedia systems are moving away from
traditionally keypad / mouse solutions
towards more natural gestures (e.g. scroll wheels,
touch interface, iDrive, …). Aditionally, real
time analysis of human body pose or
hand gestures enables to develop new, more natural interfaces.

The real
time nature of such systems shifts the focu
s away from complex trackers that
rely on heavy iterative optimizations, as they require to much computation time. The
focus of this work will therefore span very fast, simple trackers on one hand, which are
model based approaches, and on the other hand ex
ample based methods. The difference
between model based and example based approaches, is that in example based
approaches the current pose or gesture is detected from a database of preset poses or
gestures. The benefit of such approach is that the system c
an be much faster, albeit
limited to a predefined set of poses.

2. Research Goals

To carry out this research, we have to fulfill the following goals:

: the entire system relies on the segmentation of the user and/or his
body parts. The bette
r the segmentation, the easier the following tasks become.
However, segmentation can never be perfect, hence the other components in the
system will have to be robust to noise and reflections. The first type of
segmentation that will be implemented in this

work, is foreground
segmentation [2], which allows us to distinguish the user from everything else in
the ‘background’ of the scene. This segmentation is based on color, and made
robust to shadows and reflections. The second type of segmentatio
n is skin color
segmentation [3]. A skin model allows us to segment skin
colored pixels and thus
body parts such as the face and hands of the user. The face can also be detected
uses the Viola
Jones face detector. The goal of these segmentation stages is t
have a segmented silhouette of the user in each camera, a segmented silhouette of
the hands, and an approximate location of the face/head.

3D Reconstruction
: given silhouettes of the user taken from several cameras
surrounding the user, 3D voxel reconst
ruction can provide us with a 3D voxel hull
of the user, which is very useful for detecting the pose of the user. Our real
approach will use a voxel
carving technique [5]. A lookup table (LUT) provides a
reference between 2D pixel locations in the cam
era views and 3D voxel
projections in the voxel space from which the hull is carved. A major benefit of
using 3D voxel hulls is that they can be normalized for rotation, allowing for an
invariant pose detector.

: one of the most fundam
ental forms of tracking is tracking the position
and orientation of the user from an overhead camera. Limiting the tracking to the
overhead view allows for a less complex body model, and thus a real

Pose/Gesture Recognition
: poses and

gestures are detected using example based
approaches, rather than full
body tracking or articulated hand tracking.
Classification is based on either a monocular 2D silhoutte, 2D silhouettes from
multiple cameras, or 3D voxel hulls of the user (full body p
ose) or the hand (hand
gesture). These silhouettes or hulls are classified using 2D or 3D Haarlets which
allow for very short computation times [4]. Additional algorithms are also
explored, for example counting the number of extended fingers, or interpreti
pointing gestures / pointing direction.

: the classification is based on Haarlets [4] which are
rectangular wavelets which require very little computation time and therefore
open up many real
time opportunities. Next to the usua
l 2D Haarlets, this work
introduces their 3D counterparts. The standard AdaBoost training algorithm for
Haarlets is replaced by a Linear Discriminant Analysis (LDA) [7] based training
procedure. The main benefits of LDA being that it allows efficient train
ing for
classification problems with 2 classes, and more relaxed memory/CPU constraints
making it possible to train 3D Haarlets. Additionally, more advanced variations
on LDA are explored such as Average Neighborhood Margin Maximization
(ANMM) [8].

n Graphs
: once the features coefficients are computed, there are several
approaches to determining the pose or gesture based on those coefficients. The
most straightforward approach is nearest neighbors classification. However, it is
interesting to have a
more intelligent classification approach, and also make it
possible to analyze the dynamics of the pose changing in time. This can be
achieved with motion graphs.

3. Proposed Strategy

The proposed project proposes a strategy divided as the following prot
otype systems
which will be implemented and tested:

a. Overhead Tracker

The goal is to track the position and orientation of the user. The user can be segmented
from the floor and background using foreground
background segmentation (WP2), which
can be us
ed to provide a rough estimate of the position of the person. However, a tracker
based on particle filtering (WP1) is implemented to provide a more accurate estimate of
the users position and also the orientation. The model for the tracker consists of an
llipse, which represents the shoulders of the user, a circle representing the heads, and
two smaller ellipses representing the arms of the user.

b. Pose Recognition

Two approaches for pose recognition will be tested. Both systems use several cameras
ed around the user in order to detect the pose. Both systems rely on foreground
background segmentation (WP2) to extract silhouettes of the user. The position and size
of these silhouettes are normalized. The first system will classify pose directly on the
silhouettes. Therefore a classifier based on 2D Haarlets is trained (WP4) and using these
Haarlets, the pose can be estimated (WP5) and interpreted in the system (WP7).

A second pose recognition system, will use the extracted silhouettes to run a 3D v
reconstruction (WP3), resulting in a 3D voxel hull of the user. Classification is based on
these hulls. A classifier based on 3D Haarlet is trained (WP4) and using these Haarlets,
the pose can be estimated and interpreted in the system (WP7). The bene
fit of this 3D
approach is that the voxel hulls not only can be normalized for size and position, but also
for rotation, allowing for the implementation of an orientation invariant pose detection

In order to determine this orientation, either Eige
n analysis can be used, or the orientation
can be retrieved from the overhead tracker (WP1).

c. Perceptive User Interface

This system consists of a user interface where the user interacts with a large screen using
pointing gestures and hand gestures. The

user is observed with 2 or more cameras. Using
background segmentation and skin color segmentation (WP2), the location
and shape of the hands and face can be determined. By detecting the eyes and fingertip,
we can retrieve the pointing directio
n of the user. Gesture detection (WP6) is performed
on the hand to allow for interaction with the objects displayed on the screen.

4. Workplan

The work will be organized in a number of Work Packages:

WP1. Tracking

The goal is to develop an overhead t
racker to track the position and
orientation of a user. A tracker will be extended from an elliptical face tracker based on
particle filtering [1]. The model will be extended to also track rotation, and it will be
refined with head and arm parts to improve

the accuracy of the tracker.

WP2. Segmentation

In order to do further processing we will want to segment the user
from the scene, and segment interesting body parts such as the hands of the user. Firstly,
a foreground
background segmentation algorithm
will be implemented based on [2].
Secondly, a skin color segmentation system is implemented which allows us to detect
skin colored body parts such as the hands and face [3]. The face can also be detected
using the Viola
Jones face detector [4]. The goal is

to perform the segmentation on a
separate computer for each camera and then forward the result to the main system.

WP3. 3D Reconstruction

To extract the 3D voxel hulls of the user, real
time 3D voxel
reconstruction is implemented based on [5]. The reco
nstruction is based on a voxel
carving technique, and speed is ensured by using a lookup table (LUT) that links each
pixel in the camera image to a ray of voxels to be carved out of the voxel space. This
reconstruction algorithm will be integrated into the

main system, so that segmented
images of the users can be reconstructed to a voxel hull at real

WP4. Haarlet Training / Classification

In a first step alternative approaches to the
AdaBoost training process are explored. The main limitations of A
daBoost are that it is
designed for 2
class problems and it has very high memory/CPU requirements, which
make it unusable for training 3D Haarlets [6]. We will need 3D Haarlets if we want to
classify based on 3D voxel hulls. A training algorithm will be de
veloped and
implemented based on Linear Discriminant Analysis (LDA). LDA provides the most
discriminating transformation to classify between the different pose classes.
Subsequently, this LDA transformation can be approximated using Haarlets, which is the
basis of our Haarlet selection algorithm. LDA also allows us to extend the approach to
3D Haarlets, in order to classify 3D voxel hulls. Using 3D voxel hulls the orientation of
the user can be normalized which is an important benefit. In further steps, sim
algorithms to LDA with different benefits will be explored, such as Average
Neighborhood Margin Maximization (ANNM).

WP5. Pose recognition

After implementing the previous 4 work package, the work can
focus on recognizing what is going on in the sce
ne. In a first step, this will be interpreting
the (static) coefficients resulting from the LDA and the Haarlets, for example using
nearest neighbors. However, different approaches will be explored, for example to benefit
from knowing which coefficients or

poses were detected during the previous frames. This
can be achieved using motion graphs, or borrowing techniques from speech recognition.
In a next step, dynamic pose motions can be analyzed, in order to detect dynamic

WP6. Hand gesture recogn

this work page is quite similar to WP5, but instead of
analyzing the whole human body, only the hand is considered. It will be explored how
hand gestures can be detected from a monocular camera, from multiple 2D camera views,
or from a reconstructe
d hull of the hand. In a first step the gesture will be detected using
distance measures directly on the contour of the segmented hand, such as the Hausdorff
distance [3]. Subsequently, similar to the pose recognition system, features will be trained
to bu
ild a classifier, which achieves faster and more accurate results. This work page is
also interesting for collaborations with other researchers.

WP7. Motion graphs

motion graphs will be explored to improve the robustness of the
detection, both of human
pose and hand gestures, and to detect dynamic gestures, which
consist of a sequence of static hand or body poses.

5. Timeline

























6. References


An Adaptive Color
Based Particle Filter, Katja Nummiaro, Esther Koller
Meier and Luc
Van Gool


Andreas Griesser, Stefaan De Roeck, Alexander Neubeck and Luc Van Gool. GPU
Background Segmentation using an Extended Colinearity Criterion.
Vision, Modeling, and Visualization (VMV), 2005


Michael Van den Bergh, Ward Servaes, Geert Caenen, Stefaan De Roeck, Luc J. Van
Gool: Perceptive User Interface, a Generic Approach.

HCI 2005: 60


P. Viola, M. J. Jones. Robust Real
time Object Detection. IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, 2:747, 2001.


R. Kehl, M. Bray and L. Van Gool, "Full Body Tracking from Multiple Views Using
hastic Sampling", 2005 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, June 2005


Y. Ke, R. Sukthankar, and M. Hebert. Efficient Visual Event Detection using Volumetric
Features. International Conference on Computer Vision, 166
173, 2005.


K. Fukunaga. Introduction to Statistical Pattern Recognition (Second Edition). New
York: Academic Press, 1990.


Fei Wang, Changshui Zhang. Feature Extraction by Maximizing the Average
Neighborhood Margin. IEEE Conference on Computer Vision and P
attern Recognition
2007 (CVPR'07)