An Application for Mixing Reality with Virtuality

bonesworshipΤεχνίτη Νοημοσύνη και Ρομποτική

14 Νοε 2013 (πριν από 4 χρόνια και 8 μήνες)

107 εμφανίσεις


for Mixing Reality with Virtuality

Ko Wing Hong (48


The Kinect
, released initially by Microsoft as a game control device, have given many promises
in the development of virtual reality. This is because users can interact with the systems with
nothing but their own bodies. While major attention on the Kinect has been
on its role as a game
controller, the
Kinect also find uses in computer graphics and vision applications, thanks to the
inclusion of not just a colour camera, but also a depth camera and a skeleton recognition engine.
The depth camera and skeleton recognit
ion are particularly useful for academic uses as they ease
the process of human extraction and feature detection. With these tools in mind, I devised to
create an application that will combine their inputs to generate a totally virtual environment.


application, the real image taken from the colour camera is given a “cell
shading” filter to
make it cartoonish, and then combined with a cell
shaded, user
controlled character. Other
capabilities of the Kinect will be used to take care of the interaction

between the character and
the environment.


Replacing Human with Virtual Characters

The goal of the application is to take the real world image and transform it to a “virtual” image
that will look like a computer generated image. This is motivated by recent developments in “cell
shading” filters that allow a user to transform photographs
a cartoonish image. Such “cell
shading” techniques are getting wider use as it became included as a basic effect function in
cameras. What I am after, though, is a more involved experience, where the user can actually
interact with the cell
shaded env
ironment, as if s/he is in a virtual world.
However, to this end,
we see that cell
shading filters for the humans are usually inadequate:

Therefore, I

to replace the whole human with a virtual
character. This does n
ot only
make it easier to blend into the scene, it also allows the
user to become a character that only
exists in fantasies. Ide
ally, it should also be able to
erase the (real) human figure from the colour
, though this is not effectively implemented

in this iteration of the application.

2) Blending of Computer Graphics and Real Image

To blend the real image from the colour camera with the computer generated character, “cell
shading” filtering is to be applied to both components.


To make the application an “experience”, the application will need to allow interaction between
the virtual character and the environment. One example is the handling of occlusion

should happen if part of the real image is in front of th
e character. Ideally, the application should
also handle normal human actions like picking up objects, though this is not realised as of now
due to various limitations.

Previous Work

With the goal in mind, there are several projects that may be relevant t
o the implementation of
our application.

1) Make the Line Dance [1]

Make the Line Dance is a Kinect

project where the team tried to trace out the skeleton of a
tracked human, and project it back onto the human, who is dressed in black. The result is a
human figure of lines, which move when the human dances. While the team is not mapping the
skeletal dat
a to a 3D model, their handling of the skeletal data, especially the smoothness of the
movement, should shed some light on how we should handle the data.

2) Making Miku Dance with Kinect [2]

This is a project that multiple parties in Japan have tried. It is effectively motion capture using
the kinect, which is then applied to the Miku model when the Miku Dance program is run. This
project should point us at some direction on how to animate a 3
D model with respect to input
from the Kinect. However, it appears that the capturing and playback is not really in real time,
which makes it less suitable as a starting point for an application for interaction.

3) Kinect + XNA


A project by Arbuzz, th
e application takes input from the Kinect and applies the transformation
to a model in XNA, in real time. The moment is pretty smooth, and we can see that it blends in
well with the environment. The only missing piece is the interaction with the environmen
t (the
environment is just the background). The fact that this project is done with a one
team also
gives me the confidence to go on with my idea.


Considering that my application is most similar to Arbuzz’s Kinect + XNA, I decided to

take a
similar approach

which is to use the Kinect functionality in a game/graphics engine. Just like
this previous work, I will be relying much on the skeleton recognition, but I am also going to add
in interactivity by using the colour camera and dept
h camera.

Character Integration

The general flow here is to extract the skeleton data from the Kinect, and then map it to a “boned”
model in a graphics engine.
When available, the position of each joint in the skeleton data is
returned in a list by the Kinect. These position data, however, cannot actually be used directly.
This is because the length of a bone in a Kinect skeleton
may not be of the same length as
a bone
in the graphical model. Using the global
, then, will mean unwanted transformation to
the shape of the character model. Instead,
we have to change

the orientation of the bone in the
. The joint orientation, unfortunately, is not
provided by the Microsoft SDK unlike
OpenNI, and has to be estimated. The estimation goes by

using the joint positions


get a vector
that goes along the Kinect bone. Then, a rotation quaternion is generated that will orient the bone
with respect to the w
orld coordinates. The rotation is then applied to the bone the character. The
result is pretty smooth movements for the model (at least for the hands)

even though we see that
the corresponding joints in the real human and the character do not cover the ex
act same point in
the world.

In terms of rendering, the character is drawn using the Object
oriented Graphics Rendering
Engine (OGRE).
The material for the character
uses a fragment program and a vertex program
that implements the methods outlined in
In short, it renders a bulged out version of
the model in black first, and then cover it up with a material that has a stepped colour value.

We see that it blends pretty well with our treatment of the real image (to be explained below),
ly when there are more colours in the real image.

2) Processing the Real Image (from Colour Camera)

To fit the 3D model, the image data from the colour camera of the Kinect is also processed to
give it a toonish

feeling. The first step here is to increase the contrast of the image, to get in line
with the exaggerating nature of cell
shading. Then, the colour of the image is smoothed, using a
Gaussian filter provided in the OpenCV library. The (8bit) colour value
is also processed so that
it is always a multiple of 64, to give the effect of a colour patch.

The harder problem is finding the edge of the objects and painting them black. A common
method, which is also used here, is the Canny edge detector

. It wor
ks by first calculated the
pixel gradient of the illumination of the image

using the Sobel operator
. If a pixel is found to
be much different than its neighbor, i.e., large gradient, the

detector will search around the pixel
for a line that maintains a

certain gradient (threshold 1). If this line is long enough (threshold 2),
the line is drawn. The method is conveniently provided by the OpenCV library. The edge
information returned is then added to the colour image, to give the final cell
shading effect

3) Interaction with Environment

This part is what distinguishes this application with previous attempts. While, ideally, the
application should also handle things like picking up objects, this time I chose to focus to
handling mainly “occlusion”. “Occlusion” refers to the case where some

object appears in front
of our target of recognition, giving us
incomplete information of our target. This is also a
problem because in a reality mixed with virtual objects, virtual objects are usually rendered on
top of the whole real image. Hence, objec
ts that are supposedly occluding out virtual model will
be instead rendered behind the model. To handle occlusion, therefore, means that we need to
figure out which part of the image is the foreground, cut it out, and render it in front of the virtual

In this application, we do this by making use of the Kinect depth camera. The depth camera
gives us information on how far away every object is from the camera, and from it we can divide
the environment into 3 layers: objects that are wholly behind the

human, objects at the same
depth as the human, and objects in front of the human.

The handling for the first and last layers is trivial. The middle layer, however, needs more
processing. To do this, we make use of the Kinect

depth camera’s passive player tracking
capability. Turns out, the depth camera also track the player and “paint” the corresponding pixels
with a player id. In the middle layer, then, we
just need to put pixels with a player id to the
background and the re
st to the foreground. A limitation here is that objects held by a player will
be also painted the same player id. This means that in our implementation, any held objects will
be in the background and covered by our 3D model, even if it is extruding out tow
ards the
camera. We probably can handle this by analysing the pixels with a player id (see Future
Improvements), but this is not implemented in the current version.

Demonstration Video Rundown and Findings

The demo video showcases the current version of t
he application as well as some of its limitation.
In the first section (00:00 ~ 00:30), we see a normal testing of the skeleton testing. The hand
waving works pretty well, with the hands of the 3D character replicating my movement closely.
One slight probl
em is the legs which tend to overreact. This actually may have something to do
with the tight setting of my room. Referring to the Skeletal Viewer provided with the Microsoft
SDK, it can be seen that sometimes part of the table is detected as part of my th
igh, and hence
the jaggedness.

The second section (00:30 ~ 01:01) shows what happens when input from the colour camera is
While something like a hack, t
his effectively erase
s the human from the real image,
leaving only the 3D model to interact

with the environment. Here we can see that the 3D model
does blend in well with the environment, mostly thanks to the edges drawn on the real image.
We also see some testing of the occlusion where I put the left hand behind the shelf.

In the third sec
tion (01:01 ~ 01:28) I showed off some of the occlusion handling of the
application. Here we see a Kinect box on the table on the right. To reach it, the hand of the
character has to go behind a white pole. We see that while part of the arm is occluded as
the area is off to the left by some margin. This is actually a hardware limitation: the Kinect
colour camera and the depth camera are not actually aligned. This has to do with the fact that the
depth “camera” is in fact two cameras, placed at each

side of the colour camera. Be before or
beyond a certain focal point, and the depth data will be misaligned.
This comes to show that the
Kinect is, after all, a controlling device and is not designed to give precise visual data. In any
case, we see that o
cclusion works, where the character’s hand appears to be behind both the pole
and the box.

The fourth section (01:28 ~ 02:04) shows off more of the occlusion handling. This time, a flight
of stairs is moved to somewhere in front of the character. This time we see that the handling
works pretty well, really giving an image where the character is
behind the bars, and actually
interacting with the stairs to put it back. One thing to note is that the character has to be detected
first, before the stairs are moved into place. Doing otherwise will cause the stairs to be
recognised as part of the
ter and messes up the skeletal detection.

The last section (02:04 ~) is taken while testing the application in the classroom. Here, we see an
environment with many tables occluding the feet of the character. While trying to move between
the tables, we
see that the occlusion handling is not firing; instead, the skeleton recognition
thinks that the feet of the skeleton has moved to somewhere above the tables.
This is because the
feet is completely occluded giving no clue to the skeletal engine. A


limitation of

it is probably acceptable for our application to gracefully fail in this case.

Another observation is
that some patches on the wall at the far end of the wall are detected as foreground. This is
because invalid points, whether too cl
ose, or
too far
, will all give a depth data of 0, hence
classified as foreground. Again, this is a limitation of the kinect itself

Future Improvements

1) Better Bone Orientation

While viewing the demo video it is hard not to notice the jaggedness of some of the movements
of the model, especially for the legs. While as mentioned above, it can be a problem due to the
environment, it also has some roots in the fact that the bone orie
ntations are estimated.
In fact, a
common request from users of the Microsoft SDK is to expose the methods to get the exact joint
orientation, like in OpenNI.
This is the reason why most motion capture applications are using
the OpenNI API at this moment.

As of 1

February, 2012, Microsoft released version 1.0 of the SDK, which promises that the

Windows Skeletal Tracking system is now tracking subjects with results equivalent to the
Skeletal Tracking library available in the Novembe
r 2011 Xbox 360
Development Kit”. [
] With
this development it may be possible to get the bone mapping more fluidly and easily. (With
concerns over problems arising from code migration my application is sticking with the Beta

2) Handling Objects Held by the User

le objects clearly in front of the user is handled and moved to the foreground layer, objects
held by the user is painted by the depth camera as part of the user and are therefore moved to the
background. This can actually be handled if we search through t
he pixels with the same player id
looking for outliers

pixels with depth too different from the average/median).
This, however, is
potentially computationally heavy, and is left for future optimization.

3) More 3D Models

Currently, only one model is in
the system and the model is scaled according to the physique of
the human detected. Ideally, the system should be able to extract features from the human, and
pick from a library a model that best represents the human. (Or, alternatively, give the player a
interface to choose a model that s/he likes.) This involves a whole new problem of feature
extraction, and is left for future exploration.

4) Better Processing for the Real Image

While the current filtering for the real image is good enough to pair with

our 3D model, we do
still sometimes see glaring patches of noisy textures.


], which
makes use of Total Variation Optical Flow
Wedel et
. a

is a method to separate structure
textured objects) and texture, and ma
y be something that can be helpful to future iterations of
this project.

Adding Shadows

As noticed from the last section of the demo video, in an environment that is more open, the lack
of shadow makes it harder for the model to blend in. A simple
“shadow plane” on the ground
may do the job in some cases, but if our model is in front of some object we have to consider the
case where the shadow is casted on some non
planar surface. How to create such shadow is
ongoing research in the mixed reality fi
eld of computer vision.




A. Gooch , B. Gooch, P. Shirley
, E. Cohen, “A non
photorealistic lighting model for
automatic technical illustration”, 1998

[5] J. Canny, “A Computational Approach to Edge Detection”, 1986


]L. Rudin, S. Osher, E. Fatemi, ``Nonlinear Total Variation based noise removal
algorithms", 1992.

]A. Wedel, T. Pock, C. Zach, H. Bischof, and D. Cremers, “An Improved Algorithm for TV
Optical Flow”, 2009.