Real

Time Human Pose Recognition
in
Parts from Single Depth Images
Jamie
Shotton
Andrew
Fitzgibbon
Mat
Cook
Toby
Sharp
Mark
Finocchi
Richard
Moore
Alex
Kipman
Andrew
Blake
Microsoft Research Cambridge & Xbox
Incubation
CVPR 2011 Best Paper
OUTLINE
•
Introduction
•
Data
•
Body Part Inference and Joint Proposals
•
Experiments
•
Discussion
Introduction
•
Robust interactive human body tracking
–
gaming, human

computer interaction, security,
–
telepresence
, health

care
•
Real time depth cameras
–
tracking from frame to frame
but struggle to
re

initialize quickly and so are not robust
–
Our focus on per

frame initialization +
tracking
algorithm
•
focus on pose recognition in parts
–
3D position candidates for each skeletal joint
Introduction
•
appropriate tracking algorithm
–
Tracking people with twists and exponential maps (CVPR 1998)
–
Tracking
loose limbed people (CVPR 2004)
–
Nonlinear
body pose estimation from depth images (DAGM 2005)
–
Real

time
hand

tracking with a color glove (ACM 2009)
–
Real time motion capture using a single time

of

flight camera
(CVPR
2010)
Introduction
•
inspired by recent object recognition work that
divides objects into parts
–
Object class recognition
by unsupervised scale

invariant learning
[CVPR 2003]
–
The layout consistent random field for recognizing and
segmenting
partially occluded objects
[CVPR 2006]
•
Two key design goals
–
Computational efficiency
–
robustness
Introduction
Depth Image
dense probabilistic
body part
labeling
+
spatially localized
near skeletal joints
3D proposal
segment
generate
Introduction
•
We treat the segmentation into body parts
as a per

pixel classification task
–
Evaluating each pixel separately
•
Training data
–
generate realistic
synthetic depth images
–
train a deep randomized
decision forest classifier
avoid
overfitting
Introduction
•
Overfitting
•
Simple, discriminative
depth comparison image
features
•
maintaining high computational efficiency
Introduction
•
For further speed, the classifier can be run in
parallel on each pixel on a GPU
•
mean shift
resulting in the 3D joint proposals
What is Mean Shift ?
Non

parametric
Density Estimation
Non

parametric
Density
GRADIENT
Estimation
(Mean Shift)
Data
Discrete PDF Representation
PDF Analysis
PDF in feature space
•
Color space
•
Scale space
•
Actually any feature space you can conceive
•
…
A tool for
:
Finding modes in a set of data samples, manifesting an
underlying probability density function (PDF) in
R
N
Intuitive Description
Distribution of identical billiard balls
Region of
interest
Center of
mass
Mean Shift
vector
Objective
: Find the densest region
Intuitive Description
Distribution of identical billiard balls
Region of
interest
Center of
mass
Mean Shift
vector
Objective
: Find the densest region
Intuitive Description
Distribution of identical billiard balls
Region of
interest
Center of
mass
Mean Shift
vector
Objective
: Find the densest region
Intuitive Description
Distribution of identical billiard balls
Region of
interest
Center of
mass
Mean Shift
vector
Objective
: Find the densest region
Intuitive Description
Distribution of identical billiard balls
Region of
interest
Center of
mass
Mean Shift
vector
Objective
: Find the densest region
Intuitive Description
Distribution of identical billiard balls
Region of
interest
Center of
mass
Mean Shift
vector
Objective
: Find the densest region
Intuitive Description
Distribution of identical billiard balls
Region of
interest
Center of
mass
Objective
: Find the densest region
•
Treat
pose estimation as
object recognition
–
using a novel intermediate
body parts representation
–
spatially localize
joints
–
low
computational cost
and high
accuracy
Main contribution
•
(
i
) synthetic depth
training data is an
excellent
proxy
for real data
•
(ii) scaling up the learning problem with varied
synthetic data
is important for high accuracy
•
(iii) our
parts

based approach
generalizes better
than
even
an oracular exact nearest neighbor
Experiments
Data
•
Depth imaging and Motion capture data
•
Pose
estimation
research
–
often focused on
techniques
–
lack of
training
data
•
Two problems on depth image
–
color
–
pose
•
Use
real
mocap
data
–
Retargetted
to a variety of base
character models
–
to
synthesize
a
large, varied
dataset
–
640x480
image at 30 frames per
second
•
Depth
cameras > Traditional intensity sensors
–
working in
low light
levels
–
giving a
calibrated
scale estimate
–
resolving
silhouette ambiguities
in
pose
Depth image
•
capture
a large
database of motion capture
(
mocap
) of human
actions
–
approximately 500k
frames
–
(driving
, dancing, kicking, running, navigating
menus)
•
Need
not record
mocap
with variation in
rotation
–
vertical axis, mirroring left

right, scene
position
body shape
and size,
camera pose
–
all of which can be
addedin
(semi

)
automatically
Motion capture data
•
The
classifier uses no temporal
information
–
static
poses
–
not
motion
•
frame to the next are
so small
as to
be insignificant
–
using ‘
furthest neighbor
’
clustering algorithm
–
where the
distance
between
poses
–
j
mean body joints , Pi mean
i
pose
–
Define distance more than 5 cm
Motion capture data
•
necessary
to iterate the process of
motion capture
–
sampling from our
model
–
training the
classifier
–
testing joint prediction
accuracy
•
CMU
mocap
database
Motion capture data
•
build a randomized
rendering
pipeline
–
sample fully labeled training
images
•
Goals
–
realism and variety
Generating synthetic data
Generating synthetic data
•
First : randomly
samples a set
of parameters
•
Then uses
standard computer graphics techniques
–
render depth and body part images
–
from texture mapped 3D meshes
•
Use
autodesk
motionbulider
–
slight random
variation in
height
–
and
weight give extra coverage of
body shapes
–
Others parameters
Generating synthetic data
Body Part Inference
and
Joint Proposals
•
Body part labeling
•
Depth image features
•
Randomized decision forests
•
Joint position proposals
Body part labeling
•
intermediate body part representation
–
as color

coded
–
Some directly localize particular skeletal joints
–
others fill the gaps
•
transforms the problem into one that can
readily be solved by efficient
classification
algorithms
Body part labeling
•
The parts are specified in a
texture map
Body part labeling
•
31 body parts:
–
LU/RU/LW/RW head, neck,
–
L/R shoulder, LU/RU/LW/RW arm, L/R elbow, L/R wrist, L/R
–
hand, LU/RU/LW/RW torso, LU/RU/LW/RW leg, L/R knee,
–
L/R ankle, L/R foot (Left, Right, Upper,
loWer
)
Depth image features
•
di
(x) is the depth at pixel x in image I
•
Ө
= (u, v) describe offsets u and v
•
1/
di
(x) ensures the features are depth invariant
Depth image features
•
Individually these features provide
only a weak signal
•
combination in a decision forest
–
sufficient to
accurately
–
disambiguate
all trained parts
Depth image features
•
The design of these features was strongly
motivated by
their computational efficiency
–
no preprocessing is needed
–
read at most 3 image pixels
–
at most 5 arithmetic operations
–
straightforwardly implemented on the GPU
Randomized decision forests
•
Randomized decision forests
–
fast and effective multi

class classifiers
–
Implemented efficiently on the GPU
–
1
Randomized decision forests
Randomized decision forests
Joint position proposals
•
generate reliable proposals for the
positions
of 3D skeletal joints
–
the final output of our algorithm
–
used by a tracking algorithm to
self initialize
–
and
recover from failure
Joint position proposals
•
A local mode

finding approach based on
mean shift
with a weighted Gaussian kernel
–
^x
i
is the
reprojection
of image pixel xi
–
bc
is a learned per

part bandwidth
–
world space given depth
dI
(xi)
Non

Parametric Density Estimation
Assumption
: The data points are sampled from an underlying PDF
Assumed Underlying PDF
Real Data Samples
Data point density
implies PDF value !
Assumed Underlying PDF
Real Data Samples
Non

Parametric Density Estimation
Assumed Underlying PDF
Real Data Samples
?
Non

Parametric Density Estimation
Parametric
Density Estimation
Assumption
: The data points are sampled from an underlying PDF
Assumed Underlying PDF
2
2
( )
2
i
PDF( ) =
i
i
i
c e
x
μ
x
Estimate
Real Data Samples
Joint position proposals
•
Wic
considers both
the inferred body part probability
at the pixel and
the world surface area
of the pixel
Joint position proposals
•
The detected modes
–
lie on the surface of the body
–
pushed back into the scene by
a learned z offset
produce a final joint position proposal
•
Bandwidth
Bc
= 0.065m
•
Threshold
λ
c = 0.14
•
Z offset = 0.039m
•
Set = 5000 images by grid search
Joint position proposals
Experiments
•
provide further results
in the supplementary material
–
3 trees, 20 deep, 300k training images per tree
–
2000 training example pixels per image
–
2000 candidate features
Ө
–
50 candidate thresholds
ζ
per feature
Experiments
•
Test data
–
challenging
synthetic
and
real depth images
to
evaluate our approach
–
synthesize 5000 depth images
•
Real test set
–
8808 frames of real depth images
–
15 different subjects
–
7 upper body joint positions
Experiments
•
Error metric:
–
quantify both classification
•
average of the diagonal of the confusion matrix
•
between the
ground truth part label
and the most
likely inferred part label
–
Joint prediction accuracy
•
generate
recall

precision
curves
as
a function of
confidence threshold
•
quantify accuracy as average precision per joint
Experiments
•
Error metric:
–
This penalizes multiple spurious detections
–
Near the correct position which might slow a
downstream tracking algorithm
•
D = 0.1 m below closed real test data
Experiments
Experiments
Experiments
Experiments
Experiments
Experiments
•
Real time motion capture using a single time

of

flight
camera. [CVPR 2010]
Discussion
•
accurate proposals
–
for the 3D locations of body joints
–
super real

time from single depth images
•
body part recognition
–
as an intermediate representation
•
a highly varied synthetic training set
–
train very deep decision forests
–
Depth invariant features without
overfitting
Future work
•
study of
the variability
in the source
mocap
data
•
Generative model underlying the
synthesis pipeline
•
a similarly efficient approach
–
directly regress joint positions
–
remove ambiguities in local pose
Thank you
Comments 0
Log in to post a comment