Recursive MIS Computation for Streaming BDPT on the GPU

skillfulwolverineSoftware and s/w Development

Dec 2, 2013 (3 years and 11 months ago)

118 views

TECHNICAL REPORT,SEPTEMBER 2011 1
Recursive MIS Computation for Streaming
BDPT on the GPU
Dietger van Antwerpen
Delft University of Technology
Abstract—Bidirectional Path Tracing (BDPT) is a robust unbiased rendering algorithm that samples paths by connecting eye and light
paths.By optimally combining different sampling strategies using Multiple Importance Sampling (MIS),BDPT efficiently renders scenes
with complex light effects.However,BDPT does not map well on a streaming architecture such as the GPU;Stochastic path lengths
lead to an uneven workload between BDPT samples and a large memory footprint that depends on the maximumallowed path length.
In this paper,we present a streaming alternative to BDPT,called SBDPT.We present a novel scheme for recursively computing balance
heuristic MIS weights.This scheme allows for bidirectional connections to be evaluated independently,only requiring data fromthe light
and eye vertex directly involved in the connection.Using this property,we modify the BDPT algorithm to generate a new eye path for
each light vertex.The resulting SBDPT algorithm only requires temporary storage for the last vertex on both eye and light paths.Our
implementation achieves high performance and has a fixed low memory footprint,independent of the maximum path length.

1 INTRODUCTION
Since the introduction of General-Purpose GPU comput-
ing,there has been a significant increase in ray tracing
performance on the GPU.While CPU’s perform reason-
ably well on coherent rays with similar origin and direc-
tion,on current hardware the GPU vastly outperforms
the CPU when it comes to incoherent rays,required for
unbiased rendering algorithms [1].A number of path
tracers have been implemented on the GPU,resulting
in interactive unbiased renderers [2],[3].Interactive
unbiased renderers can have significant value,rapidly
providing artists and designers with accurate feedback
before starting the often timely computation of the final
high-quality images [4].
In practice,path tracing suffers from large variance
when rendering indoor scenes with strong indirect light-
ing effects or high-frequency effects such as caustics.
Bidirectional path-tracing (BDPT) [5],[6] has the same
qualities as path tracing,but is much more robust with
respect to these complex light effects.BDPT achieves
much lower variance by optimally combining different
sampling strategies using multiple importance sampling
(MIS).Already,some effort has been made to implement
BDPT on the GPU.These solutions either use the GPU
as a co-processor [4],or restrict the BDPT algorithm by
limiting its maximum path length and using suboptimal
combination weights [2].
In this paper,we present a streaming BDPT implemen-
tation that runs entirely on the GPU.Our algorithmcom-
putes the balance heuristic weights using a novel compu-
tation scheme.This scheme allows for a more streaming
formulation of the BDPT algorithm.Consequently,our
implementation does not limit the maximumpath length
and has a fixed,low memory footprint,independent of
the actual path lengths.
Fig.1:Test scenes:Kitchen,Sponza,Bathroom,and
Dragon.
2 RELATED WORK
Since the introduction of GPGPU programming,there
has been a great deal of research on GPU ray tracing.
Purcell et al.[3] were the first to publish an efficient
GPU path tracer.Since then,many researchers have
attempted to improve GPU ray tracing.Most of this
work has been focused on efficient ray traversal on the
GPU [1],[7].Particularly relevant to our discussion is
the work of Nov´ak et al.[2],who studied the decrease
in SIMT efficiency due to the stochastic termination of
TECHNICAL REPORT,SEPTEMBER 2011 2
paths during path tracing.They proposed to keep the av-
erage SIMT efficiency high by immediately regenerating
terminated paths.On current hardware,GPU ray tracing
outperforms CPU ray tracing by an order of magnitude.
Recently,some attempts have been made at mapping
the BDPT algorithm on the GPU.Nov´ak et al.[2] pre-
sented the first BDPT implementation running entirely
on the GPU.First,a large collection of independent light
paths is generated in parallel and stored in GPU mem-
ory.Then,during eye path tracing,each eye vertex is
connected to all vertices on one of these light paths.For
each light path,enough memory is allocated in advance
to store a path of some predetermined maximumlength.
Consequently,this method’s memory footprint is propor-
tional to the maximum path length,resulting in either a
large memory footprint or a low maximum path length.
They did not address MIS but simply weighted all
paths equally.Furthermore,their implementation uses
a suboptimal importance sampling scheme of sharing
Russian roulette decisions between all threads in a GPU
warp to keep GPU efficiency high.
Pajot et al.[4] presented a hybrid CBPT implementa-
tion,balancing work between CPU and GPU.First,N
E
eye paths and N
L
light paths are generated on the CPU.
Then,all vertices of all eye and light paths are copied to
the GPU.On the GPU,all eye vertices are connected to
all light vertices of all N
L
samples.The results are copied
back to the GPU and accumulated on the image plane.
All connections are evaluated in parallel,leading to high
GPU efficiency.Pajot et al.proposed to balance the CPU
and GPU workload by carefully choosing N
E
and N
L
.
Because all eye and light paths are generated on the
CPU,for large N
E
and N
L
the required GPU memory is
roughly proportional to the average path length,instead
of the maximumpath length.Therefore,there is no need
for a (tight) maximum path length.
Since the introduction of BDPT,the field of unbi-
ased physically based rendering has made significant
advances.The application of Markov Chain Monte Carlo
methods has led to more robust rendering methods
for very difficult scenes [8],[9].Unfortunately,these
methods are sequential in nature and depend heavily
on the initial state.Consequently,they do not provide
valuable feedback as rapidly as standard Monte Carlo
methods for all but the most difficult scenes.In this
paper,we will therefore ignore these methods and focus
on BDPT.
In section 7,we will compare our work with a
straightforward BDPT implementation,along the lines
of a parallel CPU implementation,and a hybrid CBPT
implementation as described by Pajot et al.[4].We
will show that our method has a significantly smaller
memory footprint while achieving faster convergence.
3 BDPT ON GPU
Unbiased Monte Carlo rendering algorithms such as
BDPT are often viewed as being embarrassingly parallel
[10].Because these algorithms require the evaluation
of many independent samples,extracting parallelism
seems as trivial as constructing many independent BDPT
samples in parallel,evaluating one sample per thread.
However,the large memory consumption of such
a trivial implementation is problematic.Each thread
requires temporary memory to store its path vertices.
During connection,these vertices are required to estab-
lish the connections and again,for each connection,to
compute the MIS weight.Enough memory is usually
allocated per thread beforehand to reach some predeter-
mined maximumpath length.Restricting the path length
introduces some bias in the final estimate.To keep bias
limited,a significantly large maximum path length is
chosen.Consequently,running many threads in parallel
results in a large memory footprint.This is problematic
because modern GPUs have limited memory.This video
memory is also needed to store the scene data.If the
scene does not fit entirely in video memory,out of core
rendering techniques are required,considerably decreas-
ing overall performance.It is therefore important to keep
the method’s memory footprint low,without introducing
(too much) bias in the estimate.
Furthermore,as pointed out by Nov´ak et al.[2],the
stochastic path lengths of BDPT samples reduce SIMT
efficiency.On GPUs with CUDAarchitecture,threads are
processed simultaneously in groups of 32,called warps;
all executing in SIMT.To achieve maximumfine-grained
parallelism a GPU requires coherence in both code exe-
cution and memory access between threads in a warp.
The stochastic paths lengths lead to an uneven workload
and incoherent execution between BDPT samples,which
reduce GPU efficiency.
This paper presents a recursive scheme for computing
the power heuristic weights,allowing independent eval-
uations of bidirectional connections,only requiring data
from the light and eye vertex directly involved in the
connection.Using this property we modify the BDPT
algorithm into the more streaming SBDPT algorithm
by generating a new eye path for each vertex on the
light path.The algorithm ensures high GPU utilization
through path regeneration,in spite of the stochastic path
termination.Our CUDAimplementation requires merely
storage for a single light and eye vertex in memory
at any time during sample evaluation.This causes the
implementation’s memory footprint to be independent
of the maximum path length,keeping memory con-
sumption low without introducing bias.
4 BACKGROUND
BDPT is a Monte Carlo based rendering algorithm.To
make an image,BDPT samples random light transport
paths from the light source to the eye.A light transport
path X = x
0
;  ;x
k
is a sequence of points with x
0
on the lens,x
k
on the light source and x
1
;  ;x
k1
on
the scene surface.The total power reaching the eye is
found by estimating the integral of radiance L(X) over
TECHNICAL REPORT,SEPTEMBER 2011 3
the space
of all paths
I =
Z


L(X) d
(X):(1)
The radiance L(x
0
;  ;x
k
) equals
L(x
0
;  ;x
k
) =W(x
1
!x
0
) 
k1
Y
j=0
G(x
j
$x
j+1
) 
k1
Y
j=1
f
r
(x
j+1
!x
j
!x
j1
)  L
e
(x
k
!x
k1
):
In this equation W(x
1
!x
0
) is the sensor sensitivity
function,L
e
(x
k
!x
k1
) is the light source emittance at
x
k
and f
r
(x
j+1
!x
j
!x
j1
) is the BSDF at x
j
.Finally,
G(x
i
$x
i+1
) is the geometric term to convert from unit
projected solid angle to unit surface area.
G(x
i
$x
i+1
) = V (x
i
$x
i+1
)


cos (
i
) cos


0
i+1


jjx
i+1
x
i
jj
2
:(2)
In this factor V (x
i
$x
i+1
) is the visibility term,which
is 1 if the two surface points are visible fromone another
and 0 otherwise.
i
and 
0
i+1
are the angles between
the local surface normals and the path segment between
vertices x
i
and x
i+1
.
BDPT samples collections of correlated light transport
paths [5],[6].It samples an eye path and a light path us-
ing random walks and connects these to form complete
light transport paths.The eye path starts at the eye and
is traced backwards into the scene.The light path starts
at a light source and is traced forward into the scene.
The random walks are randomly terminated at each
bounce using Russian roulette.Explicitly connecting the
endpoints of any eye and light path using a shadow
ray results in a complete light transport path from light
source to eye.For efficiency,instead of generating a
single path per sample,BDPT usually generates a collec-
tion of correlated paths per sample.The most common
approach is to construct paths by connecting all vertices
on one eye path to all vertices on one light path.Figure
2 shows such a BDPT sample.Pajot et al.[4] proposed
a more general approach called Combinatorial Bidirec-
tional Path Tracing (CBPT).A sample in CBPT consists
of N
E
eye paths and N
L
light paths.Light transport
paths are constructed by connecting all vertices on all
N
E
eye paths to all vertices on the N
L
light paths.For
N
E
= N
L
= 1,CBPT is equivalent to BDPT.Increasing
N
E
and N
L
results in larger collections of correlated
paths per sample.
Let X 2
be a light transport path of length k with
vertices x
0
;  ;x
k
,that is part of a BDPT sample.This
path can be constructed using k+1 different bidirectional
sampling strategies by connecting an eye path x
0
;  ;x
s
of length s  0 with a light path x
s+1
;  ;x
k
of length
t  0,with k = s + t.Note that both the eye and
light path may be of length zero.In case of a zero
length light path,the eye path ends at a light source.
This strategy is called implicit.All remaining sampling
Fig.2:A bidirectional sample
strategies are explicit because they require an explicit
connection.In case of an eye path of length 0,the light
path is directly connected to the eye.These paths are
called caustic paths.For simplicity we ignore light paths
that directly hit a finite aperture lens.p
s
(X) is defined
as the probability of generating X by connecting an eye
path of length s with a light path of length t = k s.
During a random walk the next path vertex x
i+1
is
sampled by generating a random outgoing direction
x
i
!x
i+1
based on the incoming direction x
i1
!x
i
.
The probability P (x
i1
!x
i
!x
i+1
) of sampling the
outgoing direction is usually given per unit projected
solid angle.The geometric factor gives the relation be-
tween projected solid angle and unit area.We assume
that the inverse termination probability due to Russian
roulette is incorporated in this probability.The probabil-
ity of sampling x
i+1
from x
i
per unit area as part of the
eye path is thus given by
!
p
i
= P (x
i1
!x
i
!x
i+1
) G(x
i
$x
i+1
):(3)
Similarly,the probability of sampling x
i1
from x
i
per
unit area as part of the light path is given by

p
i
= P (x
i1
x
i
x
i+1
) G(x
i1
$x
i
):(4)
As proposed by Veach and Guibas [5],we define two
special cases to keep the above notation clean:
 P (x
1
!x
0
!x
1
) is the probability of sampling
the outgoing direction x
0
!x
1
and depends on the
lens model.
 P (x
k1
x
k
x
k+1
) is the probability of sam-
pling the outgoing direction x
k1
x
k
from the
light source and depends on the emission model.
Note that vertex x
1
on an eye path is usually con-
structed by sampling a random point in screen space.
Consequently,in order to compute
!
p
0
the sampling
probability w.r.t.unit screen space must be converted
to the corresponding probability w.r.t.unit surface area
[11].
What remains are the boundary cases of sampling the
first eye and light vertex.The eye vertex is sampled
directly on the lens and the light vertex is sampled
directly on the light source.The corresponding sampling
probabilities are usually already expressed per unit area,
so no extra conversions are required:
!
p
1
=P (x
0
)

p
k+1
=P (x
k
):
(5)
TECHNICAL REPORT,SEPTEMBER 2011 4
Using the above,the bidirectional sampling probability
p
s
(X = x
0
;  ;x
k
) of sampling X by connecting an eye
path of length s with a light path of length t = k  s
equals
p
s
(x
0
;  ;x
k
) =
s1
Y
i=1
!
p
i
k+1
Y
i=s+2

p
i
:(6)
BDPT samples a collection of paths X
1
;  ;X
N
,where
each path is sampled using some bidirectional sampling
strategy.These paths are then combined into a single
Monte Carlo estimate:
I 
N
X
i=1
w(X
i
)
L(X
i
)
p(X
i
)
:(7)
Because any path may be sampled through one of mul-
tiple bidirectional sampling strategies,their contribution
must be properly weighted.w
s
(X) is the weight for a
path X of length k that is constructed by connecting an
eye path of length s and a light path of length t = k s.
There are various possible combination strategies that
result in an unbiased estimator.Veach and Guibas [5]
proposed the power heuristic
w
s
(X) =
p
s
(X)

P
k
i=0
N
i;k
p
i
(X)

;(8)
where  can be chosen freely with  = 1 being called the
balance heuristic and !1 the maximum heuristic.
N
i;k
is the upper bound for the number of paths per
sampling strategy that can occur in a BDPT sample.Note
that due to Russian roulette,most of these paths are
not actually generated.In ordinary BDPT,each strategy
occurs at most once,so N
i;k
= 1.However,because
CBPT combines multiple eye and light paths,a CBPT
sample may contain N
E
implicit paths,N
L
caustic paths,
and N
E
N
L
paths for each of the other explicit con-
nection strategies.Hence,N
0;k
= N
L
;N
k;k
= N
E
and
N
0<i<k;k
= N
E
N
L
.Veach and Guibas [5] showed that the
power heuristic is the best possible combination in the
absence of further information.In particular,he proved
that no other combination strategy can be a significant
improvement over the balance heuristic.
During BDPT sample construction,a MIS weight
needs to be computed for each connected eye and light
vertex.Veach and Guibas [5] proposed a method for
efficiently constructing balance heuristic weights.This
method is based on similarities in the bidirectional sam-
pling probabilities p
i
(X) and p
i+1
(X).The correspond-
ing strategies sample the first x
0
;  ;x
i
vertices as part
of the eye path and at least the last vertices x
i+2
;  ;x
k
as part of the light paths.Hence,they only differ in the
way vertex x
i+1
is sampled.Thus,the ratio of p
i+1
(X)
and p
i
(X) equals
p
i+1
(X)
p
i
(X)
=
!
p
i

p
i+2
:(9)
This equation can be applied repeatedly,starting with
p
s
(X) to find the other bidirectional sampling proba-
bilities p
s+1
(X);  ;p
k
(X).Similarly,the reciprocal ra-
tio p
i
(X)=p
i+1
(X) is used to compute the probabilities
p
s1
(X);  ;p
0
(X).Once all p
i
have been computed,the
computation of the balance heuristic,power heuristic or
maximum heuristic is fairly straightforward.
5 RECURSIVE MIS
Although Veach and Guibas’s [5] computation scheme
for balance heuristic weights is very generic and fairly
efficient in terms of computations,the number of op-
erations to construct each weight depends on the path
length,requiring data from all vertices on the path.
This prevents efficient parallelization of both compu-
tations and memory access on the GPU.We propose
a different scheme for computing the power heuristic
weights.During the construction of the eye and light
paths,we recursively compute one extra quantity in each
path vertex.Using this quantity,the weights for each
connection can be constructed locally,requiring only
data from the eye and light vertex directly involved in
the connection.
To do this,instead of directly computing the weight
for a connection,it is both easier and more numerically
stable to compute the inverse weight
1
w
s
(X)
=
k
X
i=0
N
i;k
p
i
(X)
p
s
(X)
:(10)
Based on common factors in the terms,this equation can
be split into three parts.
1
w
s
(X)
=
s1
X
i=0
N
i;k
p
i
(X)
p
s
(X)
+N
s;k
+
k
X
i=s+1
N
i;k
p
i
(X)
p
s
(X)
:(11)
The first term in equation 11 represents all sampling
strategies with less than s eye vertices.These strategies
all sample at least the vertices x
k
;  ;x
s
as part of
the light path.Hence,the probability of sampling these
vertices is a common factor in both numerator and
denominator and cancel out.Furthermore,all forward
sampling probabilities appearing in the numerator also
appear in the denominator and thus cancel out as well,
resulting in
s1
X
i=0
N
i;k
p
i
(X)
p
s
(X)
=
s1
X
i=0
N
i;k
Q
s+1
j=i+2

p
j
Q
s1
j=i
!
p
j
=

p
s+1
s1
X
i=0
0
@
N
i;k
!
p
i
s1
Y
j=i+1

p
j+1
!
p
j
1
A
:
By expanding the last termin this summation,the recur-
TECHNICAL REPORT,SEPTEMBER 2011 5
sion appears:
s1
X
i=0
N
i;k
p
i
(X)
p
s
(X)
=

p
s+1

0
@
N
s1;k
!
p
s1
+

p
s
!
p
s1
s2
X
i=0
0
@
N
i;k
!
p
i
s2
Y
j=i+1

p
j+1
!
p
j
1
A
1
A
:
We define the quantity d
E
s
as
d
E
1
=
N
0;k
!
p
0
d
E
s
=
s1
X
i=0
N
i;k
!
p
i
s1
Y
j=i+1

p
j+1
!
p
j
=
N
s1;k
+

p
s
d
E
s1
!
p
s1
:
(12)
During the construction of the eye path,this quantity
is recursively computed for each eye vertex.Using this
quantity,the first term in equation 11 reduces to
s1
X
i=0
N
i;k
p
i
(X)
p
s
(X)
= d
E
s

p
s+1
:(13)
Hence,by using the precomputed quantity d
E
s
when
making a connection,the first term can be computed
locally,requiring only data from the eye vertex involved
in the connection and a fixed amount of operations,
independent of the path lengths.
The second term in equation 11 corresponds to the
strategy with exactly s eye vertices and is the strategy
actually used to sample this path.Because p
s
(X) ap-
pears in both the numerator and denominator,this term
reduces to N
s;k
.
Finally,the last term in equation 11 is similar to the
first term and represents all sampling strategies with
more than s eye vertices.After eliminating all common
factors,this sum becomes
k
X
i=s+1
N
i;k
p
i
(X)
p
s
(X)
=
!
p
s
k
X
i=s+1
0
@
N
i;k

p
i+1
i1
Y
j=s+1
!
p
j

p
j+1
1
A
:
The corresponding quantity d
L
s+1
is defined as
d
L
k
=
N
k;k

p
k+1
d
L
s+1
=
k
X
i=s+1
N
i;k

p
i+1
i1
Y
j=s+1
!
p
j

p
j+1
=
N
s+1;k
+
!
p
s+1
d
L
s+2

p
s+2
:
(14)
During the construction of the light path this quantity is
recursively computed for each light vertex.Hence,the
last term in equation 11 reduces to
k
X
i=s+1
N
i;k
p
i
(X)
p
s
(X)
= d
L
s+1
!
p
s
:(15)
By substituting equations 13 and 15 into equation 11,the
equation reduces to
1
w
s
(X)
=
8
<
:
N
0;k
+d
L
1
!
p
0
if s = 0
d
E
k

p
k+1
+N
k;k
if s = k
d
E
s

p
s+1
+N
s;k
+d
L
s+1
!
p
s
other:
(16)
Note that the value

p
s
in the quantity d
E
s
depends on
both the eye and light vertex involved in the connection.
Because the light vertex is unknown during the construc-
tion of the eye path,d
E
s
can only be computed fromd
E
s1
when the actual connection is made.The same holds for
d
L
s+1
.
5.1 Local sampling
During eye path construction we must be able to eval-
uate the reverse sampling probability

p
i
.However,at
the time of evaluation no information about the light
path is available.This evaluation must therefore be
independent of the light path.Because of this restriction,
any sampling decisions at a vertex may only depend on
local information such as the incoming direction and the
surface properties,and not on global path information
such as the vertex’s index on the light path.Note that
the Russian roulette termination probability is part of the
sampling probability.Therefore,this termination proba-
bility may also only depend on local vertex information.
Obviously,the local sampling restriction also applies to
light path construction.
A similar restriction holds for N
i;k
.Because k is not
yet known during eye path construction,N
i1;k
must
be directly computable from i,without knowledge of
k.Similarly,in order to evaluate d
L
i
during light path
construction,N
i;k
must be directly computable from
t = k  i as well.Both BDPT and CBPT satisfy these
restrictions.
The first diffuse vertex x
i
on the eye path forms
an exception to the local sampling restriction.Because
connecting to/from a specular vertex results in zero
probability,all sampling strategies sampling any of the
preceding vertices x
1
;  ;x
i1
as part of the light path
result in zero probability.Therefore,the first diffuse
vertex on a light transport path is either generated as
part of the eye path or by connecting a light path vertex
directly to the eye,corresponding to i = 1.Consequently,
the sample probability P(x
i1
!x
i
!x
i+1
) is only
evaluated during eye path construction and connection,
but never during light path construction.For this reason,
this first diffuse vertex poses an exception to the local
sampling restriction.During both eye path construction
and connection evaluation,the index i is known,so the
termination decision may depend on i.This is useful,
because we usually like to force at least one diffuse
bounce on a path.Note that as the first light vertex is
a diffuse vertex,the light path can be forced to have at
least two diffuse vertices through similar reasoning.
TECHNICAL REPORT,SEPTEMBER 2011 6
Fig.3:Construction of an SBDPT sample.1) Light vertex is constructed 2) Eye path is generated and connected to
light 3) Light path is extended 4) Light path is connected to eye 5) Eye path is generated and connected to light
path 6) Light path is terminated.
5.2 Details
We claimed that only data from a single vertex on
both eye and light path is required to compute the
balance heuristic weights.However,due to the geometric
factor in

p
i
and
!
p
i
,two path vertices seem required
to compute

p
i
and
!
p
i
.We overcome this problem by
carefully splitting up these computations.For example,
the computation of

p
i
can be split into

p
0
i
=jcos (
i1
)j

p
i
=

p
0
i
P (x
i1
x
i
x
i+1
)
jcos (
0
i
)j
jjx
i
x
i1
jj
2
:
(17)
During eye path construction,

p
0
i
is computed in vertex
x
i1
and

p
i
is computed from

p
0
i
in vertex x
i
.This
way,only data from the last path vertex is required at
any point during the construction of the weights.The
computation of
!
p
i
during light path construction can
be split up in a similar way.
Remember that the balance heuristic is not the only
valid MIS weight scheme.A recursive computation
scheme for the more generic power heuristic can easily
be obtained by raising all probabilities

p and
!
p to the
power of .Similarly,the maximum heuristic can be
obtained from the balance heuristic by substituting all
addition operators by maximum operators.
5.3 Specular bounce
Thus far we assumed that we could numerically evaluate
!
p
i
and

p
i
for all i.However,this is only true for
diffuse materials.Specular materials are modeled by
Dirac delta functions [12].Therefore,sample probabil-
ities P (x
i1
!x
i
!x
i+1
) and P (x
i1
x
i
x
i+1
) at
a specular vertex x
i
also contain Dirac delta functions.
Consequently,
!
p
i
and
!
p
i
cannot be evaluated numer-
ically.Special care must be taken to handle specular
vertices correctly while computing MIS weight.
Note that in BDPT,specular vertices are never used
in an explicit connection to form a complete light path.
For such paths the Monte Carlo contribution would
equal f(X)=p
s
(X) = 0.Therefore,specular vertices are
skipped when evaluating connections.Hence,for some
light transport path X sampled with p
s
(X),vertices s
and s+1 are always diffuse.In turn,when either vertex
i or i +1 is specular,the probability p
i
(X) = 0.For this
reason,d
E
i
is slightly modified when specular vertices
are involved.Whenever either vertex i or i+1 is specular,
this path could not have been sampled by connecting
vertex i and i +1 and thus d
E
i
reduces to
d
E
i
=

p
i
d
E
i1
!
p
i1
:
(18)
The same holds for d
L
i
on the light path.
6 STREAMING BDPT
In the original BDPT algorithm,a single eye path and
light path are generated per sample and connected to
form a collection of light transport paths.In section 3
we explained why this method is not well suited for
an efficient GPU implementation.We propose therefore
to alter the BDPT algorithm to make it better fit the
streaming architecture of a GPU.We call this algorithm
Streaming BDPT or SBDPT.Instead of generating a
single eye path per sample,we generate a new eye path
for each vertex on the light path.Figure 3 shows how an
SBDPT sample is constructed.We start out with a light
path containing a single vertex on some light source (1).
After this vertex is generated,the first eye path is gener-
ated (2).During construction of the first eye path,each
vertex is directly connected to the light vertex resulting
in complete light transport paths.When the eye path
terminates,the light path is extended by a single vertex
(3).This vertex is directly connected to the eye first (4),
before a new eye path is generated (5).The vertices on
this new eye path now connect to the new endpoint of
the extended light path.This process is repeated until
the light path is terminated (6) and the SBDPT sample is
complete.Using this method,connections are no longer
made after either of the paths is complete,but instead
during the construction of the light and eye paths.At
any time during this construction only the endpoints of
the paths are connected.The computation scheme from
last section makes it possible to compute the balance
heuristic weights for each connection using only data
from the connected endpoints.Note that sampling the
next path vertex on either the eye or light vertex also
requires just the data from the current endpoint of the
path.Hence,we only need to store the endpoints;there
is no need to store the whole path.This results in a
fixed memory footprint per sample,independent of path
length.
6.1 Implicit paths
The probability of generating a light transport path as
part of a BDPT sample remains the same for SBDPT,
TECHNICAL REPORT,SEPTEMBER 2011 7
0.25
0.255
0.26
0.265
0.27
0.275
0.28
0.285
0.29
0.295
0.3
5
10
15
20
25
30
RMSE
n_l
SBDPT RMSE
BDPT RMSE
Fig.4:Sensitivity of RMS error to ~n
l
approximation in
dragon scene after 1000 samples per pixel,with
n
l
=
2:894.
except for implicit paths.Instead of generating only a
single eye path per sample,SBDPT generates on average
n
l
eye paths per sample,with
n
l
being the expected num-
ber of non-specular vertices on a light path.Hence,the
expected number of implicit paths per sample increases
by a factor of
n
l
.We should therefore use N
k;k
=
n
l
in
the balance heuristic weights.
Unfortunately,
n
l
is usually not known in advance.
Instead,we used an estimate ~n
l

n
l
.Note that with-
out taking special measures,the error in this estimate
would result in bias in the final image.In particular,
the expected contribution of implicit paths would be off
by a factor of ~n
l
=
n
l
.To counter this bias,we record the
contributions of all implicit paths in a separate image,
called the implicit image.Afterwards,we reduce the
contribution of the implicit image by ~n
l
=
n
l
and add it
to the final image,resulting in an unbiased estimate.
Although this method keeps the final estimate unbiased,
using N
k;k
= ~n
l
instead of
n
l
in the balance heuristic
results in theoretically slightly suboptimal MIS weights,
possibly increasing the variance somewhat.
From our test scenes the Dragon scene turned out to
be the most sensitive to the choice of ~n
l
.This can be at-
tributed to the presence of many highly glossy surfaces.
Consequently,this scene contains many effects for which
explicit and implicit sampling strategies have similar
sampling densities.Figure 4 quantifies the numerical
error for the Dragon scene as a function of ~n
l
after 1000
samples per pixel.The graph shows that the algorithmis
not very sensitive to the choice of ~n
l
.Therefore,a rough
estimate of ~n
l
suffices.The mean and variance of n
l
in
the test scenes is shown in table 2,indicating that a small
set of about 10
4
randomlight paths should be more than
enough to adequately estimate
n
l
in advance.
An alternative solution would be to either only ac-
count for implicit paths on the first eye path,or by
generating multiple light paths.Instead of generating a
new eye path for each vertex on the light path,we could
generate a newlight path for each vertex on the eye path.
This way,only a single eye path is created per sample
and thus N
k;k
= 1.Multiple light paths will be generated
per sample but as we do not consider the contribution
of implicit light paths (light paths hitting the lens) we
do not need to estimate
n
e
.For both alternatives the
probability of generating any path as part of an SBDPT
sample would be identical to BDPT.
Note however that by generating multiple eye paths
we only increase the probability of sampling implicit
paths,while keeping all other probabilities equal.In a
sense,we get the extra implicit paths for free.Hence,as
long as we do not overestimate the increase in probabil-
ity ( ~n
l
<
n
l
),we can only improve over BDPT.As we
force at least two diffuse vertices on each light path,we
can safely assume that
n
l
> 2 for indoor scenes.In the
remainder of this paper we use the fixed conservative
estimate of ~n
l
= 2,mainly because it eliminates the (be
it small) preprocessing phase of estimating
n
l
and thus
speeds up interactive feedback,which is an important
feature for practical renderers [4].
6.2 Implementation
In this section,we discuss the details of our SBDPT im-
plementation in CUDA.To construct an SBDPT sample,
the implementation repeatedly extends the sample with
one vertex on either eye or light path (See Figure 3).After
each extension,a connection between the endpoints of
the eye and light path is evaluated.Whenever the eye
path is terminated,is is immediately regenerated in
place.This process of repeatedly extending and connect-
ing is continued until the light path is terminated,after
which the sample is complete.
Fig.5:All samples in SBDPT stream repeatedly extend
either their eye or light paths and connect the endpoints
of both paths.Just before the next extension,any termi-
nated paths are regenerated.
The implementation constructs many independent SB-
DPT samples in parallel.During construction,SBDPT
operates on a long,fixed size stream of samples (see
Figure 5).A separate CUDA thread is executed for
each sample in the stream.The implementation applies
the extension and connection operations repeatedly to
TECHNICAL REPORT,SEPTEMBER 2011 8
all samples in the stream.When all samples are ex-
tended and connected,some of the samples may have
terminated.We use sample regeneration as proposed
by Nov´ak et al.[2] to keep the sample stream filled
and SIMT efficiency high.Before again extending all
samples,any terminated samples are regenerated in
place.We used separate kernels for the extend,connect,
generate,extend ray traversal and connect ray traversal
operations.
Because the lengths of eye and light paths vary be-
tween threads in a warp,some threads are handling their
light path while at the same time all other threads are
handling their eye paths.To keep SIMT efficiency high,
code paths for handling the extension of light and eye
path are shared.These code paths are very similar,apart
from minor variations in sampling the first path vertex
and the evaluation of adjoint BSDF’s.
To reduce visible correlation between eye paths in
an SBDPT sample,pixels are sampled using a stratified
pattern over the image plane.As a consequence,different
samples may contribute to the same pixel.Furthermore,
caustic connections may contribute to any pixel on the
image plane.Therefore,we used atomic instructions to
deposit contributions on the image plane.
We used the Brigade path tracing engine as a starting
point for our implementations [13].The Brigade engine
uses a BVH as acceleration structure and a ray traversal
kernel which is similar to the one described by Aila and
Laine [1].For random number generation,we used the
parallel Mersenne Twister provided with the NVIDIA
CUDA SDK [14].
7 RESULTS
We compared our SBDPT implementation with a GPU
path tracer,a straightforward GPU implementation of
BDPT and a hybrid CBPT implementation as described
by Pajot et al.[4].We did not make a direct comparison
with Nov´ak et al.[2] because their method lacks a
detailed description and seems to closely resemble a
straightforward BDPT.All experiments were performed
on an Intel
R
Core
TM
Duo Quad Q6600 CPU with an sin-
gle NVIDIA
R
Geforce
R
GTX 480 GPU.The implemen-
tations use CUDA for GPU programming.Our BDPT
implementation works along the lines of a traditional
parallel CPU implementation [15].First,the algorithm
constructs an eye and light path for all samples in
parallel and stores them in global GPU memory.Then,a
separate GPU thread is executed for each BDPT sample,
sequentially evaluating all connections between its eye
and light path.We used a maximum path length of 12,
which is usually enough for scenes with few transpar-
ent objects.Note that scenes with participating media
often require longer paths,further stressing memory
resources.The implementation does not use recursive
MIS,but computes the MIS weights using the traditional
approach proposed by Veach and Guibas [5].
To fully occupy all GPU resources and allow for
effective load balancing,the implementations have to
0.01
0.1
1
0
10
20
30
40
50
60
Perceptual error
Time (s)
Sponza
sbdpt
bdpt
cbpt
pt
0.01
0.1
1
0
500
1000
1500
2000
2500
3000
Perceptual error
Time (s)
Dragon
sbdpt
bdpt
cbpt
pt
0.01
0.1
1
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Perceptual error
Time (s)
Bathroom
sbdpt
bdpt
cbpt
pt
0.01
0.1
1
0
100
200
300
400
500
Perceptual error
Time (s)
Kitchen
sbdpt
bdpt
cbpt
pt
Fig.7:Numerical convergence comparison of BDPT,
SBDPT,CBPT and PT over time using perceptual error
measure.
TECHNICAL REPORT,SEPTEMBER 2011 9
(a) PT
(b) BDPT
(c) CBPT
(d) SBDPT without MIS
(e) SBDPT with MIS
Fig.6:Graphical comparison of the Sponza (left),Dragon (middle left),Bathroom(middle right) and Kitchen (right)
scene,with (a) PT,(b) BDPT,(c) CPT,(d) SBDPT without MIS and (e) SBDPT with MIS.Render times were 10s
for Sponza and 30s for Dragon,Bathroom and Kitchen.
TECHNICAL REPORT,SEPTEMBER 2011 10
execute many threads in parallel.For BDPT and SBDPT,
we found that a stream of 131:072 samples is enough.
The performance gain for larger streams is small (less
then 10%) and does not justify the increase in memory
consumption.
7.1 CBPT
The CBPT implementation uses double buffering and
multi-threading to utilize all available computing re-
sources in parallel.As suggested by Pajot et al.[4],,we
generate N
E
= N
T
eye and light paths per CBPT sample,
but connect only N
L
 N
T
of the light paths with
eye paths on the GPU.The MIS weights are adjusted
accordingly.
We found that selecting appropriate values for N
E
and N
L
is not trivial,because they not only determine
the balance between CPU and GPU work,but also the
memory footprint and the quality of the CBPT Monte
Carlo estimate.As proposed by Pajot et al.[4],we tuned
N
E
and N
L
to fully utilize both CPU and GPU.This
turned out to be quite difficult for our test platform
because the hybrid CBPT implementation does not scale
very well with GPU resources.In hybrid CBPT,the
GPU only computes local probabilities for each connec-
tion.These probabilities need to be copied back to the
CPU and combined to compute the MIS weights for
each connection.Thus,the memory bandwidth,memory
footprint and both the CPU and GPU workload are all
proportional to N
E
 N
L
.Consequently,hybrid CBPT
does not scale well with GPU resources.Because we
used a powerful GPU in combination with a somewhat
outdated CPU,GPUmemory and CPUresources quickly
became the bottleneck and prohibited full utilization of
the GPU.The same problems are likely to show when
using a high-end CPU in combination with multiple
GPUs.
Using our recursive MIS scheme solved this shortcom-
ing of CBPT.Because recursive MIS only requires local
data,the GPU can nowalso construct the MIS weight for
each connection.This reduced the CPU workload,mem-
ory transfer bandwidth and memory usage,allowing full
utilization of the GPUon our test platform.Although the
optimal parameters for CBPT depend somewhat on the
particular scene,we found that values of N
E
= N
T
=
2000 and N
L
= 60 were in general a good choice for our
test platform.
7.2 Performance
Kelemen et al.[9] noted that the RMS error is not really
good at expressing image differences and proposed a
simple perceptual error measure to quantify perceptual
convergence.The perceptual error measure is defined
as the ratio of pixels whose relative error exceeds a
given threshold.The images in Figure 1 were progres-
sively refined to 99% convergence,which is enough for
most practical uses.The remaining noise can easily be
removed using simple post processing filters.In Figure
7 we compare convergence over time for the different
algorithms using the perceptual error measure.Note that
for technical reasons,black pixels were not taken into
acount,resluting in the brief initial increase in error
for the PT algorithm.To also give an indication of
the algorithm’s interactivity,Figure 6 visually compares
image quality after a relatively short period of time.
As expected,the bidirectional algorithms significantly
outperform basic PT for the Dragon and Kitchen scenes
in both the short and long term.Both scenes show a
significant amount of indirect light.Furthermore,be-
cause lamps are realistically modeled with reflectors
and lenses,both scenes show large caustics.Therefore,
these scenes benefit much from the bidirectional sam-
pling strategies.The graphs show that SBDPT performs
slightly better than BDPT for these scenes,which can be
attributed to SBDPT’s streaming nature.This is further
confirmed by the raw performance data in table 2.
PT,BDPT and SBDPT perform fairly similar on the
Sponza scene.Initially PT converges faster but the other
algorithms eventually catch up.This can be explained
by looking at Figure 6:Although PT shows less noise
overall,it suffers from a few relatively bright fireflies
which will take longer to go away.
The graphs show that SBDPT significantly outper-
forms BDPT for the bathroom scene.This was to be
expected and is due to the shower area,which is illu-
minated through a window while being viewed through
the shower doors.Consequently,most direct light in the
scene can only be captured by implicit paths.As SBDPT
generates multiple eye paths per sample,it achieves
higher efficiency than BDPT.More surprisingly,PT is
also outperformed by both SBDPT and BDPT in the
bathroom scene,even though PT generates more eye
paths per second.Because most direct light in the bath-
room scene can only be captured by implicit paths,PT
initially shows less visible noise (see Figure 6).However,
in the long run the bidirectional sampling strategies are
necessary to effectively sample all indirect light,making
the bidirectional algorithms more efficient overall.
CBPT seems to perform consistently worse than both
BDPT and SBDPT in both the short and long term.This
is because CBPT generates significantly less eye and light
paths per second,as shown in table 2.Even though CBPT
performs similar to SBDPT in raw rays per second,it
spends most of its time evaluating connects.This shows
that choosing N
E
and N
L
so as to fully utilize all avail-
able computing resources does not necessarily improve
Monte Carlo efficiency,because the quality of the Monte
Carlo estimate depends on N
E
and N
L
.In CBPT,eye
and light paths are generated solely on the CPU,so the
sampling of implicit and caustic paths does not benefit
much from the available GPU resources.Consequently,
the caustics in the Dragon and Kitchen scene and the
reflected caustics in the Bathroom scene remain very
noisy.Interestingly,CBPT also performs worse for the
Sponza scene.This shows that trading too much eye and
light path correlation for explicit connection sampling
TECHNICAL REPORT,SEPTEMBER 2011 11
performance can significantly reduce overall efficiency,
even for relatively simple scenes.
Without our recursive MIS scheme,it would not be
possible to compute MIS weights without storing all
path vertices.The last two rows in Figure 6 show the
importance of MIS.For SBDPT without MIS,all valid
strategies are simply weighted equally,corresponding to
the balance heuristic with !0.Although SBDPT with-
out MIS does improve somewhat over PT for the Dragon
and Kitchen scenes,it suffers from high frequency noise
in the formof fireflies and is not competitive with BDPT.
The use of MIS removes most of this noise,significantly
reducing the variance in the estimate and making SBDPT
a valuable alternative to BDPT.
7.3 Memory
In our implementations,storing a triangle with shading
normals and texture coordinates requires 112 bytes.Fur-
thermore,each BVH node requires 64 bytes.Hence,a
medium sized scene of up to several million triangles
will require several hundred MBs of video memory for
geometry alone.Realistic scenes usually also require a
significant amount of texture memory.In the absence of
out-of-core rendering,this leaves little memory for the
rendering algorithms.
Memory footprint
Per Sample (Byte)
Total (MByte)
PT
152
19
BDPT
2716
339.5
CBPT
-
189
SBDPT
348
43.5
TABLE 1:Memory footprint.
Table 1 shows the memory consumption of our im-
plementations,not including the scene data.The ta-
ble shows the memory consumption per sample and
the total memory consumption,including all temporary
buffers.CBPT has no per sample memory consumption,
as it generates only a single large sample at a time.Note
that the memory consumption of CBPT depends on the
choice of N
E
and N
L
.We could select N
E
and N
L
so as
to reduce memory consumption,but this would prevent
the algorithm from fully utilizing the GPU.
The table shows that SBDPT has a significantly smaller
memory footprint than BDPT.The memory consumption
of BDPT can be reduced by choosing a lower maximum
path length.However,we risk introducing visible bias
in scenes with shiny and transparent materials.On the
other hand,SBDPT does not require a maximum path
length and has a fixed memory footprint.Like PT,
the memory consumption of SBDPT is relatively low.
On modern GPUs with several hundred MB of video
memory,this leaves enough video memory to store a
moderately large scene.
8 CONCLUSIONS
In this paper we presented a novel scheme for recur-
sively computing MIS weights for BDPT.This scheme
allows us to evaluate connections independently,requir-
ing only data from the two connected vertices.Using
this property,we developed a streaming alternative to
BDPT,called SBDPT.SBDPT has a much lower memory
footprint,independent of the maximum allowed path
length,and fits the streaming architecture of modern
GPUs better.Similar to BDPT,SBDPT achieves superior
performance compared to regular PT on difficult scenes
with complex lighting effects such as caustics and in-
direct light.At the same time,SBDPT performs better
on reflected caustics than BDPT because SBDPT spends
more time sampling eye paths.
SBDPT connects each eye vertex to only one light
vertex.We are investigating the possibility of connecting
each eye vertex to the last light vertices on multiple in-
dependent light paths in order to improve the efficiency
of sampling explicit paths by trading some correlation
for performance.
Like PT and BDPT,SBDPT still has a hard time
rendering reflected caustics for very small light sources.
Progressive photon mapping performs much better for
these effects [16].We are further investigating the use
of our MIS computation scheme towards bidirectional
progressive photon mapping [17] on the GPU for opti-
mally combining the strengths of both BDPT and photon
mapping.
Finally,we believe that SBDPT can also be particularly
useful for out-of-core bidirectional rendering on both
CPU and GPU.In out-of-core rendering the scene data
does not fit in memory.Usually out-of-core renderers
generate samples in large batches [18],[19].This way
the cost of loading a block of scene data is amortized
over many samples.By increasing the batch size,the
overhead of out-of-core rendering can be reduced.At
the same time,the sample state for all samples in a
batch must fit in memory as well.To keep the sample
state small,unbiased out-of-core renderers usually only
support path tracing.SBDPTs small fixed size sample
state allows for large batches and thus efficient out-of-
core bidirectional sampling.We leave out-of-core SBDPT
for future study.
ACKNOWLEDGMENTS
Thanks to Erik Jansen,Jacco Bikker and Jan Nov´ak for
their valuable suggestions and comments.Special thanks
to Reinier van Antwerpen for modeling the Dragon
scene.
REFERENCES
[1] T.Aila and S.Laine,“Understanding the Efficiency of
Ray Rraversal on GPUs,” in Proc.HPG ’09.New York,
NY,USA:ACM,2009,pp.145–149.[Online].Available:
http://doi.acm.org/10.1145/1572769.1572792
[2] J.Nov´ak,V.Havran,and C.Dachsbacher,“Path Regeneration
for Interactive Path Tracing,” in Proc.EUROGRAPHICS ’10,Short
Papers.Norrkoeping,Sweden:EG,2010.
[3] T.J.Purcell,I.Buck,W.R.Mark,and P.Hanrahan,“Ray Tracing
on Programmable Graphics Hardware,” in Proc.SIGGRAPH
’02.New York,NY,USA:ACM,2002,pp.703–712.[Online].
Available:http://doi.acm.org/10.1145/566570.566640
TECHNICAL REPORT,SEPTEMBER 2011 12
Performance in 10
6
/sec
BDPT
CBPT
SBDPT
n
l
distribution
Eye
Light
Ray
Eye
Light
Ray
Eye
Light
Ray
n
l

2
(n
l
)
SPONZA
2.24
2.24
22.6
0.17
0.17
23.5
4.82
1.95
27.5
2.646
1.283
KITCHEN
1.80
1.80
20.7
0.13
0.13
27.0
4.68
1.75
26.9
2.824
1.828
BATHROOM
1.49
1.49
20.7
0.09
0.09
21.2
3.35
1.40
25.6
2.495
1.033
DRAGON
1.63
1.63
21.5
0.12
0.12
24.7
4.77
1.55
28.2
2.894
1.663
TABLE 2:Measured statistics for test scenes.The first 9 columns show the performance of BDPT,CBPT and SBDPT
in 10
6
generated eye paths,light paths and traced rays per second.The last two columns show the mean and
variance of n
l
.
[4] A.Pajot,M.Paulin,L.Barthe,and P.Poulin,“Combinatorial
Bidirectional Path-Tracing for Efficient Hybrid CPU/GPU Ren-
dering,” Computer Graphics Forum,vol.30,pp.315–324,2011.
[5] E.Veach and L.J.Guibas,“Optimally Combining Sampling
Techniques for Monte Carlo Rendering,” in Proc.SIGGRAPH
’95.Addison Wesley,1995,pp.419–428.[Online].Available:
http://doi.acm.org/10.1145/218380.218498
[6] E.P.Lafortune and Y.D.Willems,“Bi-Directional Path Tracing,”
in Proc.COMPUGRAPHICS ’93,Alvor,Portugal,1993,pp.145–
153.
[7] S.Popov,J.G¨unther,H.-P.Seidel,and P.Slusallek,“Stackless
KD-Tree Traversal for High Performance GPU Ray Tracing,” In
Computer Graphics Forum,vol.26,pp.415–424,2007.
[8] E.Veach and L.J.Guibas,“Metropolis Light Transport,” in Proc.
SIGGRAPH ’97.New York,NY,USA:Addison Wesley,1997,pp.
65–76.
[9] C.Kelemen,L.Szirmay-Kalos,G.Antal,and F.Csonka,“ASimple
and Robust Mutation Strategy for the Metropolis Light Transport
Algorithm,” In Computer Graphics Forum,vol.21,pp.531–540,
2002.
[10] R.Marques and L.P.Santos,“Instant Global Illumination on the
GPU using OptiX,” 2010.
[11] D.Cline,“A Practical Introduction to Metropolis Light Trans-
port,” Department of Computer Science,Brigham Young Univer-
sity,Tech.Rep.,2005.
[12] E.Veach,“Robust Monte Carlo Methods for Light Transport
Simulation,” 1998,adviser-Guibas,Leonidas J.
[13] J.Bikker,“Brigade real-time path tracing,” 2011.[Online].
Available:http://igad.nhtv.nl/bikker/
[14] V.Podlozhnyuk,“Parallel Mersenne Twister.” NVIDIA Corpora-
tion,Tech.Rep.,2007.
[15] “Luxrender,” 2011.[Online].Available:
http://www.luxrender.net/
[16] T.Hachisuka,S.Ogaki,and H.W.Jensen,“Progressive Photon
Mapping,” in Proc.ACM SIGGRAPH Asia’08.New York,
NY,USA:ACM,2008,pp.130:1–130:8.[Online].Available:
http://doi.acm.org/10.1145/1457515.1409083
[17] J.Vorba,“Bidirectional Photon Mapping,” in Proc.CESCG ’11.
Vini
ˇ
cn
´
e,Slovakia:CESCG,2011.
[18] B.C.Budge,T.Bernardin,J.A.Stuart,S.Sengupta,K.Joy,and
J.D.Owens,“Out-of-core Data Management for Path Tracing on
Hybrid Resources,” in Proc.EUROGRAPHICS ’09,Mar.2009.
[19] J.Hanika,A.Keller,and H.P.A.Lensch,“Two-level ray tracing
with reordering for highly complex scenes,” in Proc.of GI ’10,ser.
GI ’10,2010,pp.145–152.