Real-Time Human Pose Recognition in Parts from Single Depth Images

spongemintΛογισμικό & κατασκευή λογ/κού

2 Δεκ 2013 (πριν από 3 χρόνια και 7 μήνες)

54 εμφανίσεις

Real
-
Time Human Pose Recognition

in
Parts from Single Depth Images

Jamie
Shotton


Andrew
Fitzgibbon

Mat
Cook


Toby
Sharp

Mark
Finocchi

Richard
Moore

Alex
Kipman


Andrew
Blake

Microsoft Research Cambridge & Xbox
Incubation

CVPR 2011 Best Paper

OUTLINE


Introduction


Data


Body Part Inference and Joint Proposals


Experiments


Discussion

Introduction


Robust interactive human body tracking


gaming, human
-
computer interaction, security,


telepresence
, health
-
care


Real time depth cameras


tracking from frame to frame
but struggle to
re
-
initialize quickly and so are not robust


Our focus on per
-
frame initialization +
tracking
algorithm


focus on pose recognition in parts


3D position candidates for each skeletal joint

Introduction


appropriate tracking algorithm


Tracking people with twists and exponential maps (CVPR 1998)


Tracking

loose limbed people (CVPR 2004)


Nonlinear

body pose estimation from depth images (DAGM 2005)


Real
-
time

hand
-
tracking with a color glove (ACM 2009)


Real time motion capture using a single time
-
of
-
flight camera

(CVPR
2010)


Introduction


inspired by recent object recognition work that
divides objects into parts


Object class recognition
by unsupervised scale
-
invariant learning
[CVPR 2003]


The layout consistent random field for recognizing and
segmenting
partially occluded objects
[CVPR 2006]


Two key design goals


Computational efficiency


robustness

Introduction

Depth Image

dense probabilistic
body part
labeling

+

spatially localized
near skeletal joints

3D proposal

segment

generate

Introduction


We treat the segmentation into body parts

as a per
-
pixel classification task


Evaluating each pixel separately


Training data


generate realistic
synthetic depth images


train a deep randomized
decision forest classifier


avoid
overfitting





Introduction


Overfitting









Simple, discriminative
depth comparison image
features


maintaining high computational efficiency



Introduction


For further speed, the classifier can be run in
parallel on each pixel on a GPU


mean shift
resulting in the 3D joint proposals


What is Mean Shift ?

Non
-
parametric

Density Estimation

Non
-
parametric

Density
GRADIENT

Estimation


(Mean Shift)

Data

Discrete PDF Representation

PDF Analysis

PDF in feature space



Color space



Scale space



Actually any feature space you can conceive






A tool for
:

Finding modes in a set of data samples, manifesting an

underlying probability density function (PDF) in
R
N

Intuitive Description

Distribution of identical billiard balls

Region of

interest

Center of

mass

Mean Shift

vector

Objective
: Find the densest region

Intuitive Description

Distribution of identical billiard balls

Region of

interest

Center of

mass

Mean Shift

vector

Objective
: Find the densest region

Intuitive Description

Distribution of identical billiard balls

Region of

interest

Center of

mass

Mean Shift

vector

Objective
: Find the densest region

Intuitive Description

Distribution of identical billiard balls

Region of

interest

Center of

mass

Mean Shift

vector

Objective
: Find the densest region

Intuitive Description

Distribution of identical billiard balls

Region of

interest

Center of

mass

Mean Shift

vector

Objective
: Find the densest region

Intuitive Description

Distribution of identical billiard balls

Region of

interest

Center of

mass

Mean Shift

vector

Objective
: Find the densest region

Intuitive Description

Distribution of identical billiard balls

Region of

interest

Center of

mass

Objective
: Find the densest region


Treat
pose estimation as
object recognition


using a novel intermediate
body parts representation


spatially localize
joints


low
computational cost
and high
accuracy

Main contribution


(
i
) synthetic depth
training data is an
excellent
proxy
for real data


(ii) scaling up the learning problem with varied
synthetic data
is important for high accuracy


(iii) our
parts
-
based approach

generalizes better
than

even
an oracular exact nearest neighbor




Experiments

Data


Depth imaging and Motion capture data


Pose
estimation
research


often focused on
techniques


lack of
training
data


Two problems on depth image


color


pose


Use
real
mocap

data


Retargetted

to a variety of base
character models


to
synthesize
a
large, varied
dataset


640x480
image at 30 frames per
second


Depth
cameras > Traditional intensity sensors


working in
low light
levels


giving a
calibrated
scale estimate


resolving
silhouette ambiguities

in
pose


Depth image


capture
a large
database of motion capture
(
mocap
) of human
actions


approximately 500k
frames


(driving
, dancing, kicking, running, navigating
menus)


Need
not record
mocap

with variation in
rotation


vertical axis, mirroring left
-
right, scene
position


body shape
and size,
camera pose


all of which can be
addedin

(semi
-
)
automatically

Motion capture data


The
classifier uses no temporal
information


static
poses


not
motion


frame to the next are
so small
as to
be insignificant


using ‘
furthest neighbor

clustering algorithm


where the
distance
between
poses



j

mean body joints , Pi mean
i

pose


Define distance more than 5 cm


Motion capture data


necessary
to iterate the process of
motion capture


sampling from our
model


training the
classifier


testing joint prediction
accuracy


CMU
mocap

database

Motion capture data


build a randomized
rendering
pipeline


sample fully labeled training
images


Goals


realism and variety

Generating synthetic data

Generating synthetic data


First : randomly
samples a set
of parameters


Then uses
standard computer graphics techniques


render depth and body part images


from texture mapped 3D meshes


Use
autodesk

motionbulider


slight random
variation in
height


and
weight give extra coverage of
body shapes


Others parameters

Generating synthetic data

Body Part Inference
and
Joint Proposals


Body part labeling


Depth image features


Randomized decision forests


Joint position proposals

Body part labeling


intermediate body part representation


as color
-
coded


Some directly localize particular skeletal joints


others fill the gaps


transforms the problem into one that can
readily be solved by efficient
classification
algorithms

Body part labeling


The parts are specified in a
texture map






Body part labeling


31 body parts:


LU/RU/LW/RW head, neck,


L/R shoulder, LU/RU/LW/RW arm, L/R elbow, L/R wrist, L/R


hand, LU/RU/LW/RW torso, LU/RU/LW/RW leg, L/R knee,


L/R ankle, L/R foot (Left, Right, Upper,
loWer
)



Depth image features


di

(x) is the depth at pixel x in image I


Ө
= (u, v) describe offsets u and v


1/
di

(x) ensures the features are depth invariant


Depth image features


Individually these features provide
only a weak signal


combination in a decision forest


sufficient to
accurately


disambiguate

all trained parts


Depth image features


The design of these features was strongly
motivated by
their computational efficiency


no preprocessing is needed


read at most 3 image pixels


at most 5 arithmetic operations


straightforwardly implemented on the GPU

Randomized decision forests


Randomized decision forests


fast and effective multi
-
class classifiers


Implemented efficiently on the GPU


1

Randomized decision forests

Randomized decision forests

Joint position proposals


generate reliable proposals for the
positions
of 3D skeletal joints


the final output of our algorithm


used by a tracking algorithm to
self initialize


and
recover from failure


Joint position proposals


A local mode
-
finding approach based on
mean shift
with a weighted Gaussian kernel


^x
i
is the
reprojection

of image pixel xi


bc

is a learned per
-
part bandwidth


world space given depth
dI

(xi)



Non
-
Parametric Density Estimation

Assumption

: The data points are sampled from an underlying PDF

Assumed Underlying PDF

Real Data Samples

Data point density


implies PDF value !

Assumed Underlying PDF

Real Data Samples

Non
-
Parametric Density Estimation

Assumed Underlying PDF

Real Data Samples

?

Non
-
Parametric Density Estimation

Parametric

Density Estimation

Assumption

: The data points are sampled from an underlying PDF

Assumed Underlying PDF

2
2
( )
2
i
PDF( ) =
i
i
i
c e




x-
μ
x
Estimate

Real Data Samples

Joint position proposals


Wic

considers both
the inferred body part probability
at the pixel and
the world surface area
of the pixel

Joint position proposals


The detected modes


lie on the surface of the body


pushed back into the scene by
a learned z offset
produce a final joint position proposal


Bandwidth
Bc

= 0.065m


Threshold
λ
c = 0.14


Z offset = 0.039m


Set = 5000 images by grid search

Joint position proposals

Experiments


provide further results
in the supplementary material


3 trees, 20 deep, 300k training images per tree



2000 training example pixels per image


2000 candidate features
Ө


50 candidate thresholds
ζ

per feature


Experiments


Test data


challenging
synthetic

and
real depth images
to
evaluate our approach


synthesize 5000 depth images


Real test set


8808 frames of real depth images


15 different subjects


7 upper body joint positions

Experiments


Error metric:


quantify both classification


average of the diagonal of the confusion matrix


between the
ground truth part label


and the most
likely inferred part label


Joint prediction accuracy


generate
recall
-
precision
curves
as

a function of
confidence threshold


quantify accuracy as average precision per joint



Experiments


Error metric:


This penalizes multiple spurious detections


Near the correct position which might slow a
downstream tracking algorithm


D = 0.1 m below closed real test data

Experiments

Experiments

Experiments

Experiments


Experiments

Experiments


Real time motion capture using a single time
-
of
-
flight
camera. [CVPR 2010]

Discussion


accurate proposals


for the 3D locations of body joints


super real
-
time from single depth images


body part recognition


as an intermediate representation


a highly varied synthetic training set


train very deep decision forests


Depth invariant features without
overfitting


Future work


study of
the variability
in the source
mocap

data


Generative model underlying the
synthesis pipeline


a similarly efficient approach


directly regress joint positions


remove ambiguities in local pose


Thank you