TECHNICAL REPORT,SEPTEMBER 2011 1
Recursive MIS Computation for Streaming
BDPT on the GPU
Dietger van Antwerpen
Delft University of Technology
Abstract—Bidirectional Path Tracing (BDPT) is a robust unbiased rendering algorithm that samples paths by connecting eye and light
paths.By optimally combining different sampling strategies using Multiple Importance Sampling (MIS),BDPT efﬁciently renders scenes
with complex light effects.However,BDPT does not map well on a streaming architecture such as the GPU;Stochastic path lengths
lead to an uneven workload between BDPT samples and a large memory footprint that depends on the maximumallowed path length.
In this paper,we present a streaming alternative to BDPT,called SBDPT.We present a novel scheme for recursively computing balance
heuristic MIS weights.This scheme allows for bidirectional connections to be evaluated independently,only requiring data fromthe light
and eye vertex directly involved in the connection.Using this property,we modify the BDPT algorithm to generate a new eye path for
each light vertex.The resulting SBDPT algorithm only requires temporary storage for the last vertex on both eye and light paths.Our
implementation achieves high performance and has a ﬁxed low memory footprint,independent of the maximum path length.
1 INTRODUCTION
Since the introduction of GeneralPurpose GPU comput
ing,there has been a signiﬁcant increase in ray tracing
performance on the GPU.While CPU’s perform reason
ably well on coherent rays with similar origin and direc
tion,on current hardware the GPU vastly outperforms
the CPU when it comes to incoherent rays,required for
unbiased rendering algorithms [1].A number of path
tracers have been implemented on the GPU,resulting
in interactive unbiased renderers [2],[3].Interactive
unbiased renderers can have signiﬁcant value,rapidly
providing artists and designers with accurate feedback
before starting the often timely computation of the ﬁnal
highquality images [4].
In practice,path tracing suffers from large variance
when rendering indoor scenes with strong indirect light
ing effects or highfrequency effects such as caustics.
Bidirectional pathtracing (BDPT) [5],[6] has the same
qualities as path tracing,but is much more robust with
respect to these complex light effects.BDPT achieves
much lower variance by optimally combining different
sampling strategies using multiple importance sampling
(MIS).Already,some effort has been made to implement
BDPT on the GPU.These solutions either use the GPU
as a coprocessor [4],or restrict the BDPT algorithm by
limiting its maximum path length and using suboptimal
combination weights [2].
In this paper,we present a streaming BDPT implemen
tation that runs entirely on the GPU.Our algorithmcom
putes the balance heuristic weights using a novel compu
tation scheme.This scheme allows for a more streaming
formulation of the BDPT algorithm.Consequently,our
implementation does not limit the maximumpath length
and has a ﬁxed,low memory footprint,independent of
the actual path lengths.
Fig.1:Test scenes:Kitchen,Sponza,Bathroom,and
Dragon.
2 RELATED WORK
Since the introduction of GPGPU programming,there
has been a great deal of research on GPU ray tracing.
Purcell et al.[3] were the ﬁrst to publish an efﬁcient
GPU path tracer.Since then,many researchers have
attempted to improve GPU ray tracing.Most of this
work has been focused on efﬁcient ray traversal on the
GPU [1],[7].Particularly relevant to our discussion is
the work of Nov´ak et al.[2],who studied the decrease
in SIMT efﬁciency due to the stochastic termination of
TECHNICAL REPORT,SEPTEMBER 2011 2
paths during path tracing.They proposed to keep the av
erage SIMT efﬁciency high by immediately regenerating
terminated paths.On current hardware,GPU ray tracing
outperforms CPU ray tracing by an order of magnitude.
Recently,some attempts have been made at mapping
the BDPT algorithm on the GPU.Nov´ak et al.[2] pre
sented the ﬁrst BDPT implementation running entirely
on the GPU.First,a large collection of independent light
paths is generated in parallel and stored in GPU mem
ory.Then,during eye path tracing,each eye vertex is
connected to all vertices on one of these light paths.For
each light path,enough memory is allocated in advance
to store a path of some predetermined maximumlength.
Consequently,this method’s memory footprint is propor
tional to the maximum path length,resulting in either a
large memory footprint or a low maximum path length.
They did not address MIS but simply weighted all
paths equally.Furthermore,their implementation uses
a suboptimal importance sampling scheme of sharing
Russian roulette decisions between all threads in a GPU
warp to keep GPU efﬁciency high.
Pajot et al.[4] presented a hybrid CBPT implementa
tion,balancing work between CPU and GPU.First,N
E
eye paths and N
L
light paths are generated on the CPU.
Then,all vertices of all eye and light paths are copied to
the GPU.On the GPU,all eye vertices are connected to
all light vertices of all N
L
samples.The results are copied
back to the GPU and accumulated on the image plane.
All connections are evaluated in parallel,leading to high
GPU efﬁciency.Pajot et al.proposed to balance the CPU
and GPU workload by carefully choosing N
E
and N
L
.
Because all eye and light paths are generated on the
CPU,for large N
E
and N
L
the required GPU memory is
roughly proportional to the average path length,instead
of the maximumpath length.Therefore,there is no need
for a (tight) maximum path length.
Since the introduction of BDPT,the ﬁeld of unbi
ased physically based rendering has made signiﬁcant
advances.The application of Markov Chain Monte Carlo
methods has led to more robust rendering methods
for very difﬁcult scenes [8],[9].Unfortunately,these
methods are sequential in nature and depend heavily
on the initial state.Consequently,they do not provide
valuable feedback as rapidly as standard Monte Carlo
methods for all but the most difﬁcult scenes.In this
paper,we will therefore ignore these methods and focus
on BDPT.
In section 7,we will compare our work with a
straightforward BDPT implementation,along the lines
of a parallel CPU implementation,and a hybrid CBPT
implementation as described by Pajot et al.[4].We
will show that our method has a signiﬁcantly smaller
memory footprint while achieving faster convergence.
3 BDPT ON GPU
Unbiased Monte Carlo rendering algorithms such as
BDPT are often viewed as being embarrassingly parallel
[10].Because these algorithms require the evaluation
of many independent samples,extracting parallelism
seems as trivial as constructing many independent BDPT
samples in parallel,evaluating one sample per thread.
However,the large memory consumption of such
a trivial implementation is problematic.Each thread
requires temporary memory to store its path vertices.
During connection,these vertices are required to estab
lish the connections and again,for each connection,to
compute the MIS weight.Enough memory is usually
allocated per thread beforehand to reach some predeter
mined maximumpath length.Restricting the path length
introduces some bias in the ﬁnal estimate.To keep bias
limited,a signiﬁcantly large maximum path length is
chosen.Consequently,running many threads in parallel
results in a large memory footprint.This is problematic
because modern GPUs have limited memory.This video
memory is also needed to store the scene data.If the
scene does not ﬁt entirely in video memory,out of core
rendering techniques are required,considerably decreas
ing overall performance.It is therefore important to keep
the method’s memory footprint low,without introducing
(too much) bias in the estimate.
Furthermore,as pointed out by Nov´ak et al.[2],the
stochastic path lengths of BDPT samples reduce SIMT
efﬁciency.On GPUs with CUDAarchitecture,threads are
processed simultaneously in groups of 32,called warps;
all executing in SIMT.To achieve maximumﬁnegrained
parallelism a GPU requires coherence in both code exe
cution and memory access between threads in a warp.
The stochastic paths lengths lead to an uneven workload
and incoherent execution between BDPT samples,which
reduce GPU efﬁciency.
This paper presents a recursive scheme for computing
the power heuristic weights,allowing independent eval
uations of bidirectional connections,only requiring data
from the light and eye vertex directly involved in the
connection.Using this property we modify the BDPT
algorithm into the more streaming SBDPT algorithm
by generating a new eye path for each vertex on the
light path.The algorithm ensures high GPU utilization
through path regeneration,in spite of the stochastic path
termination.Our CUDAimplementation requires merely
storage for a single light and eye vertex in memory
at any time during sample evaluation.This causes the
implementation’s memory footprint to be independent
of the maximum path length,keeping memory con
sumption low without introducing bias.
4 BACKGROUND
BDPT is a Monte Carlo based rendering algorithm.To
make an image,BDPT samples random light transport
paths from the light source to the eye.A light transport
path X = x
0
; ;x
k
is a sequence of points with x
0
on the lens,x
k
on the light source and x
1
; ;x
k1
on
the scene surface.The total power reaching the eye is
found by estimating the integral of radiance L(X) over
TECHNICAL REPORT,SEPTEMBER 2011 3
the space
of all paths
I =
Z
L(X) d
(X):(1)
The radiance L(x
0
; ;x
k
) equals
L(x
0
; ;x
k
) =W(x
1
!x
0
)
k1
Y
j=0
G(x
j
$x
j+1
)
k1
Y
j=1
f
r
(x
j+1
!x
j
!x
j1
) L
e
(x
k
!x
k1
):
In this equation W(x
1
!x
0
) is the sensor sensitivity
function,L
e
(x
k
!x
k1
) is the light source emittance at
x
k
and f
r
(x
j+1
!x
j
!x
j1
) is the BSDF at x
j
.Finally,
G(x
i
$x
i+1
) is the geometric term to convert from unit
projected solid angle to unit surface area.
G(x
i
$x
i+1
) = V (x
i
$x
i+1
)
cos (
i
) cos
0
i+1
jjx
i+1
x
i
jj
2
:(2)
In this factor V (x
i
$x
i+1
) is the visibility term,which
is 1 if the two surface points are visible fromone another
and 0 otherwise.
i
and
0
i+1
are the angles between
the local surface normals and the path segment between
vertices x
i
and x
i+1
.
BDPT samples collections of correlated light transport
paths [5],[6].It samples an eye path and a light path us
ing random walks and connects these to form complete
light transport paths.The eye path starts at the eye and
is traced backwards into the scene.The light path starts
at a light source and is traced forward into the scene.
The random walks are randomly terminated at each
bounce using Russian roulette.Explicitly connecting the
endpoints of any eye and light path using a shadow
ray results in a complete light transport path from light
source to eye.For efﬁciency,instead of generating a
single path per sample,BDPT usually generates a collec
tion of correlated paths per sample.The most common
approach is to construct paths by connecting all vertices
on one eye path to all vertices on one light path.Figure
2 shows such a BDPT sample.Pajot et al.[4] proposed
a more general approach called Combinatorial Bidirec
tional Path Tracing (CBPT).A sample in CBPT consists
of N
E
eye paths and N
L
light paths.Light transport
paths are constructed by connecting all vertices on all
N
E
eye paths to all vertices on the N
L
light paths.For
N
E
= N
L
= 1,CBPT is equivalent to BDPT.Increasing
N
E
and N
L
results in larger collections of correlated
paths per sample.
Let X 2
be a light transport path of length k with
vertices x
0
; ;x
k
,that is part of a BDPT sample.This
path can be constructed using k+1 different bidirectional
sampling strategies by connecting an eye path x
0
; ;x
s
of length s 0 with a light path x
s+1
; ;x
k
of length
t 0,with k = s + t.Note that both the eye and
light path may be of length zero.In case of a zero
length light path,the eye path ends at a light source.
This strategy is called implicit.All remaining sampling
Fig.2:A bidirectional sample
strategies are explicit because they require an explicit
connection.In case of an eye path of length 0,the light
path is directly connected to the eye.These paths are
called caustic paths.For simplicity we ignore light paths
that directly hit a ﬁnite aperture lens.p
s
(X) is deﬁned
as the probability of generating X by connecting an eye
path of length s with a light path of length t = k s.
During a random walk the next path vertex x
i+1
is
sampled by generating a random outgoing direction
x
i
!x
i+1
based on the incoming direction x
i1
!x
i
.
The probability P (x
i1
!x
i
!x
i+1
) of sampling the
outgoing direction is usually given per unit projected
solid angle.The geometric factor gives the relation be
tween projected solid angle and unit area.We assume
that the inverse termination probability due to Russian
roulette is incorporated in this probability.The probabil
ity of sampling x
i+1
from x
i
per unit area as part of the
eye path is thus given by
!
p
i
= P (x
i1
!x
i
!x
i+1
) G(x
i
$x
i+1
):(3)
Similarly,the probability of sampling x
i1
from x
i
per
unit area as part of the light path is given by
p
i
= P (x
i1
x
i
x
i+1
) G(x
i1
$x
i
):(4)
As proposed by Veach and Guibas [5],we deﬁne two
special cases to keep the above notation clean:
P (x
1
!x
0
!x
1
) is the probability of sampling
the outgoing direction x
0
!x
1
and depends on the
lens model.
P (x
k1
x
k
x
k+1
) is the probability of sam
pling the outgoing direction x
k1
x
k
from the
light source and depends on the emission model.
Note that vertex x
1
on an eye path is usually con
structed by sampling a random point in screen space.
Consequently,in order to compute
!
p
0
the sampling
probability w.r.t.unit screen space must be converted
to the corresponding probability w.r.t.unit surface area
[11].
What remains are the boundary cases of sampling the
ﬁrst eye and light vertex.The eye vertex is sampled
directly on the lens and the light vertex is sampled
directly on the light source.The corresponding sampling
probabilities are usually already expressed per unit area,
so no extra conversions are required:
!
p
1
=P (x
0
)
p
k+1
=P (x
k
):
(5)
TECHNICAL REPORT,SEPTEMBER 2011 4
Using the above,the bidirectional sampling probability
p
s
(X = x
0
; ;x
k
) of sampling X by connecting an eye
path of length s with a light path of length t = k s
equals
p
s
(x
0
; ;x
k
) =
s1
Y
i=1
!
p
i
k+1
Y
i=s+2
p
i
:(6)
BDPT samples a collection of paths X
1
; ;X
N
,where
each path is sampled using some bidirectional sampling
strategy.These paths are then combined into a single
Monte Carlo estimate:
I
N
X
i=1
w(X
i
)
L(X
i
)
p(X
i
)
:(7)
Because any path may be sampled through one of mul
tiple bidirectional sampling strategies,their contribution
must be properly weighted.w
s
(X) is the weight for a
path X of length k that is constructed by connecting an
eye path of length s and a light path of length t = k s.
There are various possible combination strategies that
result in an unbiased estimator.Veach and Guibas [5]
proposed the power heuristic
w
s
(X) =
p
s
(X)
P
k
i=0
N
i;k
p
i
(X)
;(8)
where can be chosen freely with = 1 being called the
balance heuristic and !1 the maximum heuristic.
N
i;k
is the upper bound for the number of paths per
sampling strategy that can occur in a BDPT sample.Note
that due to Russian roulette,most of these paths are
not actually generated.In ordinary BDPT,each strategy
occurs at most once,so N
i;k
= 1.However,because
CBPT combines multiple eye and light paths,a CBPT
sample may contain N
E
implicit paths,N
L
caustic paths,
and N
E
N
L
paths for each of the other explicit con
nection strategies.Hence,N
0;k
= N
L
;N
k;k
= N
E
and
N
0<i<k;k
= N
E
N
L
.Veach and Guibas [5] showed that the
power heuristic is the best possible combination in the
absence of further information.In particular,he proved
that no other combination strategy can be a signiﬁcant
improvement over the balance heuristic.
During BDPT sample construction,a MIS weight
needs to be computed for each connected eye and light
vertex.Veach and Guibas [5] proposed a method for
efﬁciently constructing balance heuristic weights.This
method is based on similarities in the bidirectional sam
pling probabilities p
i
(X) and p
i+1
(X).The correspond
ing strategies sample the ﬁrst x
0
; ;x
i
vertices as part
of the eye path and at least the last vertices x
i+2
; ;x
k
as part of the light paths.Hence,they only differ in the
way vertex x
i+1
is sampled.Thus,the ratio of p
i+1
(X)
and p
i
(X) equals
p
i+1
(X)
p
i
(X)
=
!
p
i
p
i+2
:(9)
This equation can be applied repeatedly,starting with
p
s
(X) to ﬁnd the other bidirectional sampling proba
bilities p
s+1
(X); ;p
k
(X).Similarly,the reciprocal ra
tio p
i
(X)=p
i+1
(X) is used to compute the probabilities
p
s1
(X); ;p
0
(X).Once all p
i
have been computed,the
computation of the balance heuristic,power heuristic or
maximum heuristic is fairly straightforward.
5 RECURSIVE MIS
Although Veach and Guibas’s [5] computation scheme
for balance heuristic weights is very generic and fairly
efﬁcient in terms of computations,the number of op
erations to construct each weight depends on the path
length,requiring data from all vertices on the path.
This prevents efﬁcient parallelization of both compu
tations and memory access on the GPU.We propose
a different scheme for computing the power heuristic
weights.During the construction of the eye and light
paths,we recursively compute one extra quantity in each
path vertex.Using this quantity,the weights for each
connection can be constructed locally,requiring only
data from the eye and light vertex directly involved in
the connection.
To do this,instead of directly computing the weight
for a connection,it is both easier and more numerically
stable to compute the inverse weight
1
w
s
(X)
=
k
X
i=0
N
i;k
p
i
(X)
p
s
(X)
:(10)
Based on common factors in the terms,this equation can
be split into three parts.
1
w
s
(X)
=
s1
X
i=0
N
i;k
p
i
(X)
p
s
(X)
+N
s;k
+
k
X
i=s+1
N
i;k
p
i
(X)
p
s
(X)
:(11)
The ﬁrst term in equation 11 represents all sampling
strategies with less than s eye vertices.These strategies
all sample at least the vertices x
k
; ;x
s
as part of
the light path.Hence,the probability of sampling these
vertices is a common factor in both numerator and
denominator and cancel out.Furthermore,all forward
sampling probabilities appearing in the numerator also
appear in the denominator and thus cancel out as well,
resulting in
s1
X
i=0
N
i;k
p
i
(X)
p
s
(X)
=
s1
X
i=0
N
i;k
Q
s+1
j=i+2
p
j
Q
s1
j=i
!
p
j
=
p
s+1
s1
X
i=0
0
@
N
i;k
!
p
i
s1
Y
j=i+1
p
j+1
!
p
j
1
A
:
By expanding the last termin this summation,the recur
TECHNICAL REPORT,SEPTEMBER 2011 5
sion appears:
s1
X
i=0
N
i;k
p
i
(X)
p
s
(X)
=
p
s+1
0
@
N
s1;k
!
p
s1
+
p
s
!
p
s1
s2
X
i=0
0
@
N
i;k
!
p
i
s2
Y
j=i+1
p
j+1
!
p
j
1
A
1
A
:
We deﬁne the quantity d
E
s
as
d
E
1
=
N
0;k
!
p
0
d
E
s
=
s1
X
i=0
N
i;k
!
p
i
s1
Y
j=i+1
p
j+1
!
p
j
=
N
s1;k
+
p
s
d
E
s1
!
p
s1
:
(12)
During the construction of the eye path,this quantity
is recursively computed for each eye vertex.Using this
quantity,the ﬁrst term in equation 11 reduces to
s1
X
i=0
N
i;k
p
i
(X)
p
s
(X)
= d
E
s
p
s+1
:(13)
Hence,by using the precomputed quantity d
E
s
when
making a connection,the ﬁrst term can be computed
locally,requiring only data from the eye vertex involved
in the connection and a ﬁxed amount of operations,
independent of the path lengths.
The second term in equation 11 corresponds to the
strategy with exactly s eye vertices and is the strategy
actually used to sample this path.Because p
s
(X) ap
pears in both the numerator and denominator,this term
reduces to N
s;k
.
Finally,the last term in equation 11 is similar to the
ﬁrst term and represents all sampling strategies with
more than s eye vertices.After eliminating all common
factors,this sum becomes
k
X
i=s+1
N
i;k
p
i
(X)
p
s
(X)
=
!
p
s
k
X
i=s+1
0
@
N
i;k
p
i+1
i1
Y
j=s+1
!
p
j
p
j+1
1
A
:
The corresponding quantity d
L
s+1
is deﬁned as
d
L
k
=
N
k;k
p
k+1
d
L
s+1
=
k
X
i=s+1
N
i;k
p
i+1
i1
Y
j=s+1
!
p
j
p
j+1
=
N
s+1;k
+
!
p
s+1
d
L
s+2
p
s+2
:
(14)
During the construction of the light path this quantity is
recursively computed for each light vertex.Hence,the
last term in equation 11 reduces to
k
X
i=s+1
N
i;k
p
i
(X)
p
s
(X)
= d
L
s+1
!
p
s
:(15)
By substituting equations 13 and 15 into equation 11,the
equation reduces to
1
w
s
(X)
=
8
<
:
N
0;k
+d
L
1
!
p
0
if s = 0
d
E
k
p
k+1
+N
k;k
if s = k
d
E
s
p
s+1
+N
s;k
+d
L
s+1
!
p
s
other:
(16)
Note that the value
p
s
in the quantity d
E
s
depends on
both the eye and light vertex involved in the connection.
Because the light vertex is unknown during the construc
tion of the eye path,d
E
s
can only be computed fromd
E
s1
when the actual connection is made.The same holds for
d
L
s+1
.
5.1 Local sampling
During eye path construction we must be able to eval
uate the reverse sampling probability
p
i
.However,at
the time of evaluation no information about the light
path is available.This evaluation must therefore be
independent of the light path.Because of this restriction,
any sampling decisions at a vertex may only depend on
local information such as the incoming direction and the
surface properties,and not on global path information
such as the vertex’s index on the light path.Note that
the Russian roulette termination probability is part of the
sampling probability.Therefore,this termination proba
bility may also only depend on local vertex information.
Obviously,the local sampling restriction also applies to
light path construction.
A similar restriction holds for N
i;k
.Because k is not
yet known during eye path construction,N
i1;k
must
be directly computable from i,without knowledge of
k.Similarly,in order to evaluate d
L
i
during light path
construction,N
i;k
must be directly computable from
t = k i as well.Both BDPT and CBPT satisfy these
restrictions.
The ﬁrst diffuse vertex x
i
on the eye path forms
an exception to the local sampling restriction.Because
connecting to/from a specular vertex results in zero
probability,all sampling strategies sampling any of the
preceding vertices x
1
; ;x
i1
as part of the light path
result in zero probability.Therefore,the ﬁrst diffuse
vertex on a light transport path is either generated as
part of the eye path or by connecting a light path vertex
directly to the eye,corresponding to i = 1.Consequently,
the sample probability P(x
i1
!x
i
!x
i+1
) is only
evaluated during eye path construction and connection,
but never during light path construction.For this reason,
this ﬁrst diffuse vertex poses an exception to the local
sampling restriction.During both eye path construction
and connection evaluation,the index i is known,so the
termination decision may depend on i.This is useful,
because we usually like to force at least one diffuse
bounce on a path.Note that as the ﬁrst light vertex is
a diffuse vertex,the light path can be forced to have at
least two diffuse vertices through similar reasoning.
TECHNICAL REPORT,SEPTEMBER 2011 6
Fig.3:Construction of an SBDPT sample.1) Light vertex is constructed 2) Eye path is generated and connected to
light 3) Light path is extended 4) Light path is connected to eye 5) Eye path is generated and connected to light
path 6) Light path is terminated.
5.2 Details
We claimed that only data from a single vertex on
both eye and light path is required to compute the
balance heuristic weights.However,due to the geometric
factor in
p
i
and
!
p
i
,two path vertices seem required
to compute
p
i
and
!
p
i
.We overcome this problem by
carefully splitting up these computations.For example,
the computation of
p
i
can be split into
p
0
i
=jcos (
i1
)j
p
i
=
p
0
i
P (x
i1
x
i
x
i+1
)
jcos (
0
i
)j
jjx
i
x
i1
jj
2
:
(17)
During eye path construction,
p
0
i
is computed in vertex
x
i1
and
p
i
is computed from
p
0
i
in vertex x
i
.This
way,only data from the last path vertex is required at
any point during the construction of the weights.The
computation of
!
p
i
during light path construction can
be split up in a similar way.
Remember that the balance heuristic is not the only
valid MIS weight scheme.A recursive computation
scheme for the more generic power heuristic can easily
be obtained by raising all probabilities
p and
!
p to the
power of .Similarly,the maximum heuristic can be
obtained from the balance heuristic by substituting all
addition operators by maximum operators.
5.3 Specular bounce
Thus far we assumed that we could numerically evaluate
!
p
i
and
p
i
for all i.However,this is only true for
diffuse materials.Specular materials are modeled by
Dirac delta functions [12].Therefore,sample probabil
ities P (x
i1
!x
i
!x
i+1
) and P (x
i1
x
i
x
i+1
) at
a specular vertex x
i
also contain Dirac delta functions.
Consequently,
!
p
i
and
!
p
i
cannot be evaluated numer
ically.Special care must be taken to handle specular
vertices correctly while computing MIS weight.
Note that in BDPT,specular vertices are never used
in an explicit connection to form a complete light path.
For such paths the Monte Carlo contribution would
equal f(X)=p
s
(X) = 0.Therefore,specular vertices are
skipped when evaluating connections.Hence,for some
light transport path X sampled with p
s
(X),vertices s
and s+1 are always diffuse.In turn,when either vertex
i or i +1 is specular,the probability p
i
(X) = 0.For this
reason,d
E
i
is slightly modiﬁed when specular vertices
are involved.Whenever either vertex i or i+1 is specular,
this path could not have been sampled by connecting
vertex i and i +1 and thus d
E
i
reduces to
d
E
i
=
p
i
d
E
i1
!
p
i1
:
(18)
The same holds for d
L
i
on the light path.
6 STREAMING BDPT
In the original BDPT algorithm,a single eye path and
light path are generated per sample and connected to
form a collection of light transport paths.In section 3
we explained why this method is not well suited for
an efﬁcient GPU implementation.We propose therefore
to alter the BDPT algorithm to make it better ﬁt the
streaming architecture of a GPU.We call this algorithm
Streaming BDPT or SBDPT.Instead of generating a
single eye path per sample,we generate a new eye path
for each vertex on the light path.Figure 3 shows how an
SBDPT sample is constructed.We start out with a light
path containing a single vertex on some light source (1).
After this vertex is generated,the ﬁrst eye path is gener
ated (2).During construction of the ﬁrst eye path,each
vertex is directly connected to the light vertex resulting
in complete light transport paths.When the eye path
terminates,the light path is extended by a single vertex
(3).This vertex is directly connected to the eye ﬁrst (4),
before a new eye path is generated (5).The vertices on
this new eye path now connect to the new endpoint of
the extended light path.This process is repeated until
the light path is terminated (6) and the SBDPT sample is
complete.Using this method,connections are no longer
made after either of the paths is complete,but instead
during the construction of the light and eye paths.At
any time during this construction only the endpoints of
the paths are connected.The computation scheme from
last section makes it possible to compute the balance
heuristic weights for each connection using only data
from the connected endpoints.Note that sampling the
next path vertex on either the eye or light vertex also
requires just the data from the current endpoint of the
path.Hence,we only need to store the endpoints;there
is no need to store the whole path.This results in a
ﬁxed memory footprint per sample,independent of path
length.
6.1 Implicit paths
The probability of generating a light transport path as
part of a BDPT sample remains the same for SBDPT,
TECHNICAL REPORT,SEPTEMBER 2011 7
0.25
0.255
0.26
0.265
0.27
0.275
0.28
0.285
0.29
0.295
0.3
5
10
15
20
25
30
RMSE
n_l
SBDPT RMSE
BDPT RMSE
Fig.4:Sensitivity of RMS error to ~n
l
approximation in
dragon scene after 1000 samples per pixel,with
n
l
=
2:894.
except for implicit paths.Instead of generating only a
single eye path per sample,SBDPT generates on average
n
l
eye paths per sample,with
n
l
being the expected num
ber of nonspecular vertices on a light path.Hence,the
expected number of implicit paths per sample increases
by a factor of
n
l
.We should therefore use N
k;k
=
n
l
in
the balance heuristic weights.
Unfortunately,
n
l
is usually not known in advance.
Instead,we used an estimate ~n
l
n
l
.Note that with
out taking special measures,the error in this estimate
would result in bias in the ﬁnal image.In particular,
the expected contribution of implicit paths would be off
by a factor of ~n
l
=
n
l
.To counter this bias,we record the
contributions of all implicit paths in a separate image,
called the implicit image.Afterwards,we reduce the
contribution of the implicit image by ~n
l
=
n
l
and add it
to the ﬁnal image,resulting in an unbiased estimate.
Although this method keeps the ﬁnal estimate unbiased,
using N
k;k
= ~n
l
instead of
n
l
in the balance heuristic
results in theoretically slightly suboptimal MIS weights,
possibly increasing the variance somewhat.
From our test scenes the Dragon scene turned out to
be the most sensitive to the choice of ~n
l
.This can be at
tributed to the presence of many highly glossy surfaces.
Consequently,this scene contains many effects for which
explicit and implicit sampling strategies have similar
sampling densities.Figure 4 quantiﬁes the numerical
error for the Dragon scene as a function of ~n
l
after 1000
samples per pixel.The graph shows that the algorithmis
not very sensitive to the choice of ~n
l
.Therefore,a rough
estimate of ~n
l
sufﬁces.The mean and variance of n
l
in
the test scenes is shown in table 2,indicating that a small
set of about 10
4
randomlight paths should be more than
enough to adequately estimate
n
l
in advance.
An alternative solution would be to either only ac
count for implicit paths on the ﬁrst eye path,or by
generating multiple light paths.Instead of generating a
new eye path for each vertex on the light path,we could
generate a newlight path for each vertex on the eye path.
This way,only a single eye path is created per sample
and thus N
k;k
= 1.Multiple light paths will be generated
per sample but as we do not consider the contribution
of implicit light paths (light paths hitting the lens) we
do not need to estimate
n
e
.For both alternatives the
probability of generating any path as part of an SBDPT
sample would be identical to BDPT.
Note however that by generating multiple eye paths
we only increase the probability of sampling implicit
paths,while keeping all other probabilities equal.In a
sense,we get the extra implicit paths for free.Hence,as
long as we do not overestimate the increase in probabil
ity ( ~n
l
<
n
l
),we can only improve over BDPT.As we
force at least two diffuse vertices on each light path,we
can safely assume that
n
l
> 2 for indoor scenes.In the
remainder of this paper we use the ﬁxed conservative
estimate of ~n
l
= 2,mainly because it eliminates the (be
it small) preprocessing phase of estimating
n
l
and thus
speeds up interactive feedback,which is an important
feature for practical renderers [4].
6.2 Implementation
In this section,we discuss the details of our SBDPT im
plementation in CUDA.To construct an SBDPT sample,
the implementation repeatedly extends the sample with
one vertex on either eye or light path (See Figure 3).After
each extension,a connection between the endpoints of
the eye and light path is evaluated.Whenever the eye
path is terminated,is is immediately regenerated in
place.This process of repeatedly extending and connect
ing is continued until the light path is terminated,after
which the sample is complete.
Fig.5:All samples in SBDPT stream repeatedly extend
either their eye or light paths and connect the endpoints
of both paths.Just before the next extension,any termi
nated paths are regenerated.
The implementation constructs many independent SB
DPT samples in parallel.During construction,SBDPT
operates on a long,ﬁxed size stream of samples (see
Figure 5).A separate CUDA thread is executed for
each sample in the stream.The implementation applies
the extension and connection operations repeatedly to
TECHNICAL REPORT,SEPTEMBER 2011 8
all samples in the stream.When all samples are ex
tended and connected,some of the samples may have
terminated.We use sample regeneration as proposed
by Nov´ak et al.[2] to keep the sample stream ﬁlled
and SIMT efﬁciency high.Before again extending all
samples,any terminated samples are regenerated in
place.We used separate kernels for the extend,connect,
generate,extend ray traversal and connect ray traversal
operations.
Because the lengths of eye and light paths vary be
tween threads in a warp,some threads are handling their
light path while at the same time all other threads are
handling their eye paths.To keep SIMT efﬁciency high,
code paths for handling the extension of light and eye
path are shared.These code paths are very similar,apart
from minor variations in sampling the ﬁrst path vertex
and the evaluation of adjoint BSDF’s.
To reduce visible correlation between eye paths in
an SBDPT sample,pixels are sampled using a stratiﬁed
pattern over the image plane.As a consequence,different
samples may contribute to the same pixel.Furthermore,
caustic connections may contribute to any pixel on the
image plane.Therefore,we used atomic instructions to
deposit contributions on the image plane.
We used the Brigade path tracing engine as a starting
point for our implementations [13].The Brigade engine
uses a BVH as acceleration structure and a ray traversal
kernel which is similar to the one described by Aila and
Laine [1].For random number generation,we used the
parallel Mersenne Twister provided with the NVIDIA
CUDA SDK [14].
7 RESULTS
We compared our SBDPT implementation with a GPU
path tracer,a straightforward GPU implementation of
BDPT and a hybrid CBPT implementation as described
by Pajot et al.[4].We did not make a direct comparison
with Nov´ak et al.[2] because their method lacks a
detailed description and seems to closely resemble a
straightforward BDPT.All experiments were performed
on an Intel
R
Core
TM
Duo Quad Q6600 CPU with an sin
gle NVIDIA
R
Geforce
R
GTX 480 GPU.The implemen
tations use CUDA for GPU programming.Our BDPT
implementation works along the lines of a traditional
parallel CPU implementation [15].First,the algorithm
constructs an eye and light path for all samples in
parallel and stores them in global GPU memory.Then,a
separate GPU thread is executed for each BDPT sample,
sequentially evaluating all connections between its eye
and light path.We used a maximum path length of 12,
which is usually enough for scenes with few transpar
ent objects.Note that scenes with participating media
often require longer paths,further stressing memory
resources.The implementation does not use recursive
MIS,but computes the MIS weights using the traditional
approach proposed by Veach and Guibas [5].
To fully occupy all GPU resources and allow for
effective load balancing,the implementations have to
0.01
0.1
1
0
10
20
30
40
50
60
Perceptual error
Time (s)
Sponza
sbdpt
bdpt
cbpt
pt
0.01
0.1
1
0
500
1000
1500
2000
2500
3000
Perceptual error
Time (s)
Dragon
sbdpt
bdpt
cbpt
pt
0.01
0.1
1
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Perceptual error
Time (s)
Bathroom
sbdpt
bdpt
cbpt
pt
0.01
0.1
1
0
100
200
300
400
500
Perceptual error
Time (s)
Kitchen
sbdpt
bdpt
cbpt
pt
Fig.7:Numerical convergence comparison of BDPT,
SBDPT,CBPT and PT over time using perceptual error
measure.
TECHNICAL REPORT,SEPTEMBER 2011 9
(a) PT
(b) BDPT
(c) CBPT
(d) SBDPT without MIS
(e) SBDPT with MIS
Fig.6:Graphical comparison of the Sponza (left),Dragon (middle left),Bathroom(middle right) and Kitchen (right)
scene,with (a) PT,(b) BDPT,(c) CPT,(d) SBDPT without MIS and (e) SBDPT with MIS.Render times were 10s
for Sponza and 30s for Dragon,Bathroom and Kitchen.
TECHNICAL REPORT,SEPTEMBER 2011 10
execute many threads in parallel.For BDPT and SBDPT,
we found that a stream of 131:072 samples is enough.
The performance gain for larger streams is small (less
then 10%) and does not justify the increase in memory
consumption.
7.1 CBPT
The CBPT implementation uses double buffering and
multithreading to utilize all available computing re
sources in parallel.As suggested by Pajot et al.[4],,we
generate N
E
= N
T
eye and light paths per CBPT sample,
but connect only N
L
N
T
of the light paths with
eye paths on the GPU.The MIS weights are adjusted
accordingly.
We found that selecting appropriate values for N
E
and N
L
is not trivial,because they not only determine
the balance between CPU and GPU work,but also the
memory footprint and the quality of the CBPT Monte
Carlo estimate.As proposed by Pajot et al.[4],we tuned
N
E
and N
L
to fully utilize both CPU and GPU.This
turned out to be quite difﬁcult for our test platform
because the hybrid CBPT implementation does not scale
very well with GPU resources.In hybrid CBPT,the
GPU only computes local probabilities for each connec
tion.These probabilities need to be copied back to the
CPU and combined to compute the MIS weights for
each connection.Thus,the memory bandwidth,memory
footprint and both the CPU and GPU workload are all
proportional to N
E
N
L
.Consequently,hybrid CBPT
does not scale well with GPU resources.Because we
used a powerful GPU in combination with a somewhat
outdated CPU,GPUmemory and CPUresources quickly
became the bottleneck and prohibited full utilization of
the GPU.The same problems are likely to show when
using a highend CPU in combination with multiple
GPUs.
Using our recursive MIS scheme solved this shortcom
ing of CBPT.Because recursive MIS only requires local
data,the GPU can nowalso construct the MIS weight for
each connection.This reduced the CPU workload,mem
ory transfer bandwidth and memory usage,allowing full
utilization of the GPUon our test platform.Although the
optimal parameters for CBPT depend somewhat on the
particular scene,we found that values of N
E
= N
T
=
2000 and N
L
= 60 were in general a good choice for our
test platform.
7.2 Performance
Kelemen et al.[9] noted that the RMS error is not really
good at expressing image differences and proposed a
simple perceptual error measure to quantify perceptual
convergence.The perceptual error measure is deﬁned
as the ratio of pixels whose relative error exceeds a
given threshold.The images in Figure 1 were progres
sively reﬁned to 99% convergence,which is enough for
most practical uses.The remaining noise can easily be
removed using simple post processing ﬁlters.In Figure
7 we compare convergence over time for the different
algorithms using the perceptual error measure.Note that
for technical reasons,black pixels were not taken into
acount,resluting in the brief initial increase in error
for the PT algorithm.To also give an indication of
the algorithm’s interactivity,Figure 6 visually compares
image quality after a relatively short period of time.
As expected,the bidirectional algorithms signiﬁcantly
outperform basic PT for the Dragon and Kitchen scenes
in both the short and long term.Both scenes show a
signiﬁcant amount of indirect light.Furthermore,be
cause lamps are realistically modeled with reﬂectors
and lenses,both scenes show large caustics.Therefore,
these scenes beneﬁt much from the bidirectional sam
pling strategies.The graphs show that SBDPT performs
slightly better than BDPT for these scenes,which can be
attributed to SBDPT’s streaming nature.This is further
conﬁrmed by the raw performance data in table 2.
PT,BDPT and SBDPT perform fairly similar on the
Sponza scene.Initially PT converges faster but the other
algorithms eventually catch up.This can be explained
by looking at Figure 6:Although PT shows less noise
overall,it suffers from a few relatively bright ﬁreﬂies
which will take longer to go away.
The graphs show that SBDPT signiﬁcantly outper
forms BDPT for the bathroom scene.This was to be
expected and is due to the shower area,which is illu
minated through a window while being viewed through
the shower doors.Consequently,most direct light in the
scene can only be captured by implicit paths.As SBDPT
generates multiple eye paths per sample,it achieves
higher efﬁciency than BDPT.More surprisingly,PT is
also outperformed by both SBDPT and BDPT in the
bathroom scene,even though PT generates more eye
paths per second.Because most direct light in the bath
room scene can only be captured by implicit paths,PT
initially shows less visible noise (see Figure 6).However,
in the long run the bidirectional sampling strategies are
necessary to effectively sample all indirect light,making
the bidirectional algorithms more efﬁcient overall.
CBPT seems to perform consistently worse than both
BDPT and SBDPT in both the short and long term.This
is because CBPT generates signiﬁcantly less eye and light
paths per second,as shown in table 2.Even though CBPT
performs similar to SBDPT in raw rays per second,it
spends most of its time evaluating connects.This shows
that choosing N
E
and N
L
so as to fully utilize all avail
able computing resources does not necessarily improve
Monte Carlo efﬁciency,because the quality of the Monte
Carlo estimate depends on N
E
and N
L
.In CBPT,eye
and light paths are generated solely on the CPU,so the
sampling of implicit and caustic paths does not beneﬁt
much from the available GPU resources.Consequently,
the caustics in the Dragon and Kitchen scene and the
reﬂected caustics in the Bathroom scene remain very
noisy.Interestingly,CBPT also performs worse for the
Sponza scene.This shows that trading too much eye and
light path correlation for explicit connection sampling
TECHNICAL REPORT,SEPTEMBER 2011 11
performance can signiﬁcantly reduce overall efﬁciency,
even for relatively simple scenes.
Without our recursive MIS scheme,it would not be
possible to compute MIS weights without storing all
path vertices.The last two rows in Figure 6 show the
importance of MIS.For SBDPT without MIS,all valid
strategies are simply weighted equally,corresponding to
the balance heuristic with !0.Although SBDPT with
out MIS does improve somewhat over PT for the Dragon
and Kitchen scenes,it suffers from high frequency noise
in the formof ﬁreﬂies and is not competitive with BDPT.
The use of MIS removes most of this noise,signiﬁcantly
reducing the variance in the estimate and making SBDPT
a valuable alternative to BDPT.
7.3 Memory
In our implementations,storing a triangle with shading
normals and texture coordinates requires 112 bytes.Fur
thermore,each BVH node requires 64 bytes.Hence,a
medium sized scene of up to several million triangles
will require several hundred MBs of video memory for
geometry alone.Realistic scenes usually also require a
signiﬁcant amount of texture memory.In the absence of
outofcore rendering,this leaves little memory for the
rendering algorithms.
Memory footprint
Per Sample (Byte)
Total (MByte)
PT
152
19
BDPT
2716
339.5
CBPT

189
SBDPT
348
43.5
TABLE 1:Memory footprint.
Table 1 shows the memory consumption of our im
plementations,not including the scene data.The ta
ble shows the memory consumption per sample and
the total memory consumption,including all temporary
buffers.CBPT has no per sample memory consumption,
as it generates only a single large sample at a time.Note
that the memory consumption of CBPT depends on the
choice of N
E
and N
L
.We could select N
E
and N
L
so as
to reduce memory consumption,but this would prevent
the algorithm from fully utilizing the GPU.
The table shows that SBDPT has a signiﬁcantly smaller
memory footprint than BDPT.The memory consumption
of BDPT can be reduced by choosing a lower maximum
path length.However,we risk introducing visible bias
in scenes with shiny and transparent materials.On the
other hand,SBDPT does not require a maximum path
length and has a ﬁxed memory footprint.Like PT,
the memory consumption of SBDPT is relatively low.
On modern GPUs with several hundred MB of video
memory,this leaves enough video memory to store a
moderately large scene.
8 CONCLUSIONS
In this paper we presented a novel scheme for recur
sively computing MIS weights for BDPT.This scheme
allows us to evaluate connections independently,requir
ing only data from the two connected vertices.Using
this property,we developed a streaming alternative to
BDPT,called SBDPT.SBDPT has a much lower memory
footprint,independent of the maximum allowed path
length,and ﬁts the streaming architecture of modern
GPUs better.Similar to BDPT,SBDPT achieves superior
performance compared to regular PT on difﬁcult scenes
with complex lighting effects such as caustics and in
direct light.At the same time,SBDPT performs better
on reﬂected caustics than BDPT because SBDPT spends
more time sampling eye paths.
SBDPT connects each eye vertex to only one light
vertex.We are investigating the possibility of connecting
each eye vertex to the last light vertices on multiple in
dependent light paths in order to improve the efﬁciency
of sampling explicit paths by trading some correlation
for performance.
Like PT and BDPT,SBDPT still has a hard time
rendering reﬂected caustics for very small light sources.
Progressive photon mapping performs much better for
these effects [16].We are further investigating the use
of our MIS computation scheme towards bidirectional
progressive photon mapping [17] on the GPU for opti
mally combining the strengths of both BDPT and photon
mapping.
Finally,we believe that SBDPT can also be particularly
useful for outofcore bidirectional rendering on both
CPU and GPU.In outofcore rendering the scene data
does not ﬁt in memory.Usually outofcore renderers
generate samples in large batches [18],[19].This way
the cost of loading a block of scene data is amortized
over many samples.By increasing the batch size,the
overhead of outofcore rendering can be reduced.At
the same time,the sample state for all samples in a
batch must ﬁt in memory as well.To keep the sample
state small,unbiased outofcore renderers usually only
support path tracing.SBDPTs small ﬁxed size sample
state allows for large batches and thus efﬁcient outof
core bidirectional sampling.We leave outofcore SBDPT
for future study.
ACKNOWLEDGMENTS
Thanks to Erik Jansen,Jacco Bikker and Jan Nov´ak for
their valuable suggestions and comments.Special thanks
to Reinier van Antwerpen for modeling the Dragon
scene.
REFERENCES
[1] T.Aila and S.Laine,“Understanding the Efﬁciency of
Ray Rraversal on GPUs,” in Proc.HPG ’09.New York,
NY,USA:ACM,2009,pp.145–149.[Online].Available:
http://doi.acm.org/10.1145/1572769.1572792
[2] J.Nov´ak,V.Havran,and C.Dachsbacher,“Path Regeneration
for Interactive Path Tracing,” in Proc.EUROGRAPHICS ’10,Short
Papers.Norrkoeping,Sweden:EG,2010.
[3] T.J.Purcell,I.Buck,W.R.Mark,and P.Hanrahan,“Ray Tracing
on Programmable Graphics Hardware,” in Proc.SIGGRAPH
’02.New York,NY,USA:ACM,2002,pp.703–712.[Online].
Available:http://doi.acm.org/10.1145/566570.566640
TECHNICAL REPORT,SEPTEMBER 2011 12
Performance in 10
6
/sec
BDPT
CBPT
SBDPT
n
l
distribution
Eye
Light
Ray
Eye
Light
Ray
Eye
Light
Ray
n
l
2
(n
l
)
SPONZA
2.24
2.24
22.6
0.17
0.17
23.5
4.82
1.95
27.5
2.646
1.283
KITCHEN
1.80
1.80
20.7
0.13
0.13
27.0
4.68
1.75
26.9
2.824
1.828
BATHROOM
1.49
1.49
20.7
0.09
0.09
21.2
3.35
1.40
25.6
2.495
1.033
DRAGON
1.63
1.63
21.5
0.12
0.12
24.7
4.77
1.55
28.2
2.894
1.663
TABLE 2:Measured statistics for test scenes.The ﬁrst 9 columns show the performance of BDPT,CBPT and SBDPT
in 10
6
generated eye paths,light paths and traced rays per second.The last two columns show the mean and
variance of n
l
.
[4] A.Pajot,M.Paulin,L.Barthe,and P.Poulin,“Combinatorial
Bidirectional PathTracing for Efﬁcient Hybrid CPU/GPU Ren
dering,” Computer Graphics Forum,vol.30,pp.315–324,2011.
[5] E.Veach and L.J.Guibas,“Optimally Combining Sampling
Techniques for Monte Carlo Rendering,” in Proc.SIGGRAPH
’95.Addison Wesley,1995,pp.419–428.[Online].Available:
http://doi.acm.org/10.1145/218380.218498
[6] E.P.Lafortune and Y.D.Willems,“BiDirectional Path Tracing,”
in Proc.COMPUGRAPHICS ’93,Alvor,Portugal,1993,pp.145–
153.
[7] S.Popov,J.G¨unther,H.P.Seidel,and P.Slusallek,“Stackless
KDTree Traversal for High Performance GPU Ray Tracing,” In
Computer Graphics Forum,vol.26,pp.415–424,2007.
[8] E.Veach and L.J.Guibas,“Metropolis Light Transport,” in Proc.
SIGGRAPH ’97.New York,NY,USA:Addison Wesley,1997,pp.
65–76.
[9] C.Kelemen,L.SzirmayKalos,G.Antal,and F.Csonka,“ASimple
and Robust Mutation Strategy for the Metropolis Light Transport
Algorithm,” In Computer Graphics Forum,vol.21,pp.531–540,
2002.
[10] R.Marques and L.P.Santos,“Instant Global Illumination on the
GPU using OptiX,” 2010.
[11] D.Cline,“A Practical Introduction to Metropolis Light Trans
port,” Department of Computer Science,Brigham Young Univer
sity,Tech.Rep.,2005.
[12] E.Veach,“Robust Monte Carlo Methods for Light Transport
Simulation,” 1998,adviserGuibas,Leonidas J.
[13] J.Bikker,“Brigade realtime path tracing,” 2011.[Online].
Available:http://igad.nhtv.nl/bikker/
[14] V.Podlozhnyuk,“Parallel Mersenne Twister.” NVIDIA Corpora
tion,Tech.Rep.,2007.
[15] “Luxrender,” 2011.[Online].Available:
http://www.luxrender.net/
[16] T.Hachisuka,S.Ogaki,and H.W.Jensen,“Progressive Photon
Mapping,” in Proc.ACM SIGGRAPH Asia’08.New York,
NY,USA:ACM,2008,pp.130:1–130:8.[Online].Available:
http://doi.acm.org/10.1145/1457515.1409083
[17] J.Vorba,“Bidirectional Photon Mapping,” in Proc.CESCG ’11.
Vini
ˇ
cn
´
e,Slovakia:CESCG,2011.
[18] B.C.Budge,T.Bernardin,J.A.Stuart,S.Sengupta,K.Joy,and
J.D.Owens,“Outofcore Data Management for Path Tracing on
Hybrid Resources,” in Proc.EUROGRAPHICS ’09,Mar.2009.
[19] J.Hanika,A.Keller,and H.P.A.Lensch,“Twolevel ray tracing
with reordering for highly complex scenes,” in Proc.of GI ’10,ser.
GI ’10,2010,pp.145–152.
Comments 0
Log in to post a comment