Real

time image deconvolution on the GPU
James T. Klosowski and Shankar Krishnan
AT&T Labs Research, 180 Park Ave., Florham Park, NJ USA 07932
ABSTRACT
Two

dimensional image
deconvolution is an important and well

studied problem with applications to image deblurring
and restoration. Most of the best deconvolution algorithms use natural image statistics that act as priors to regularize the
problem. Recently, Krishnan and Ferg
us provide a fast deconvolution algorithm that yield
s
results comparable to the
current state of the art. They use a hyper

Laplacian image prior to regularize the problem. The resulting optimization
problem is solved using alternating minimization in conju
nction with a half

quadratic penalty function. In this paper, we
provide an efficient CUDA implementation of their algorithm on the GPU. Our implementation leverages many well

known CUDA optimization techniques, as well as several others that have a signi
ficant impact on this particular
algorithm. We discuss each of these, as well as make a few observations regarding the CUFFT library. Our experiments
were run on an Nvidia GeForce GTX 260. For a single channel image of size 710 x 470, we obtain over 40 f
ps, while on
a larger image of size 1900 x 1266, we get almost 6 fps (without counting disk I/O).
In addition to linear performance,
we believe ours is the first implementation to perform deconvolutions at video rates.
Our running times also
demonstrate
that our GPU implem
entation is
over
27
times faster than the original CPU implementation.
Keywords:
Deconvolution, deblurring, GPU, CUDA, hyper

Laplacian, FFT
1.
INTRODUCTION
Two

dimensional image deconvolution is an
important
and well

studied problem with a
pplications to image deblurring
and restoration. The original problem
formulation
is ill

posed, and requires some form of regularization to make it
tractable
. Numerous deconvolution algorithms exist, varying greatly in their performance and solution. Simpl
e filtering
algorithms are extremely fast, but yield poor results. Most of the best algorithms use natural image statistics that act as
priors to regularize the problem.
More recently, Krishnan and Fergus
[1]
provide a fast
non

blind
deconvolution algorith
m that yield
s
results comparable
to the current state of the art. Krishnan and Fergus use a hyper

Laplacian image prior to regularize the problem. The
resulting optimization problem is solved using alternating minimization in conjunction with a half

quadra
tic penalty
function. While one of the sub

problems uses a standard L
2
minimization solved in closed form in the Fourier domain,
the key novelty in their approach is that the other sub

problem is a non

convex problem that is separable over the pixels.
This
per

pixel step can be solved by polynomial root finding. Since the coefficients of the polynomial depend only on
system parameters and a range of image gradient values, the valid roots can be pre

computed and stored in a lookup
table
(LUT)
. This makes the
ir approach particularly well

suited for a GPU implementation.
There have been some attempts to implement image deconvolution algorithms on the GPU. Recently, Domanski et. al.
[2]
have implemented
the Richardson

Lucy algorithm [
3,4
]
on an Nv
idia GTX 260
GPU using Fourier space
convolutions. They report an 8.5x performance improvement over a CPU implementation.
Our Contributions:
In this paper, we pr
ovide an implementation of the Krishnan

Fergus
algorithm on the GPU
using
the CUDA a
rchitecture.
We have ma
de extensive use of many well

known CUDA optimization techniques, as well as
several others that have a significant impact on our particular algorithm. Our optimizations fall into four categories:
memory transfers between host and device, device memory ac
cess, parallel execution, and instruction throughput. One
of the most valuable lessons learned was to compute as much data on the GPU as possible, rather than using the CPU
and then transfer
r
ing the data. We discuss this and each of the other optimizatio
ns used, as well as a few additional
observations regarding the CUFFT library.
Our experiments were
performed on a Linux workstation
with an Nv
idia GeForce GTX
260.
On a single channel image
of size 710 x 470, we obtain
over 40
f
rames per second (fps)
, wh
ile on a larger image of size 1900 x 1266, we get
almost
6
f
p
s (ignoring
disk I/O). On a wider assortment of images, our implementation processes over 3 megapixels (1
,
048
,
576
pixels) pe
r second. This is a factor of 27
times faster than the CPU implementa
tion by Krishnan and Fergus
[1]
.
2.
KRISHNAN/FERGUS ALGO
RITHM
In this section, we briefly review the algorithm of Krishnan and Fergus [1] for non

blind deconvolution of images. Their
method relies on the fact that the gradients of natural images exhibit a
heavy

tail
ed distribution, and have been e
ffective
in the past for regularizing the (ill

posed) deconvolution problem. These heavy

tailed distributions can be modeled us
ing
a hyper

Laplacian function:
(
)
, where
0.5 ≤
α
≤ 0.8.
We start by form
ulating the deconvolution problem,
assuming the blur kernel is known.
Most of the notation used is borrowed from the original paper.
Let
x
be the
original uncorrupted linear
grayscale image of
N
pixels;
y
be the observed image corrupted with blur and/or
noise, which is assumed to be the result of a convolution with a known blur kernel
k
and
the
addition of zero

mean
Gaussian noise. The problem is to reconstruct
x
given
y
and
k
. Given the ill

posed nature o
f the problem, they regularize
it using a penalty term
that acts on the output of a set of known filters
. In their paper, Krishnan and
Fergus use the two simplest gradients in the x

and y

direction
s
(
[

]
and
[

]
). Thei
r formulation
seeks to find a MAP (Maximum

a

Posteriori) estimate of
x
in a probabilistic framework. This is equivalent to the
following optimization problem (
i
is the pixel index, λ controls the strength of the regularization, and
⊕
is the 2D
convolution
operator):
Let us denote
(
⊕
)
.
Using the half

quadratic penalty method of Geman and Yang
[5
], they introduce
auxiliary variables
and
(denoted by
w
), thereby allowing them to move the
terms outside the
expression.
This results in a new cost function to be minimized.
β is a weight that increases
continuously throughout the optimization. As β
, the solution of the modified cost
function (eqn. (2)) converges to the original cost function (eqn. (1)). Their
algorithm uses an alternating minimization
scheme where
the non

convex part of the problem is
solved in one phase, followed by a quadratic phase which can
be
efficiently solved in the frequency domain using FFTs
. The alternating minimization proceeds by solving for the optimal
x
for a fixed
w
(called the x sub

problem), and vice versa (called the w
sub

problem).
2.1
Solving the x sub

problem
In this case, because
w
is fixed,
eq
uation
(2) is quadratic in x. Let
⊕
. Rewriting equation
(2) in matrix form
and taking derivatives,
x
can be obtained by solving the following linear system:
Assuming circular boundary conditions, they apply 2D FFTs which diagonalize the convolution matrices
and
and
x
can be solved optimally as:
where * denotes complex conjugate and
∘
denotes component

wise multiplication. The division is also performed
component

wise.
∑
(
(
⊕
)
∑

(
⊕
)

)
(1)
∑
(
(
⊕
)
(
‖
‖
‖
‖
)
∑


)
(2)
(
)
(3
)
(
(
)
∘
(
)
(
)
∘
(
)
(
)
(
)
∘
(
)
(
)
∘
(
)
(
)
∘
(
)
(
)
(
)
∘
(
)
)
(4
)
2.2
Solving the w sub

problem
The novelty in the Krishnan

Fergus paper stems from this step. Given a fixed
x
, finding the optimal
w
requires solving
2N independent 1D problems of the form:
where
. For the special case when
(
)
, the solution to the above minimization problem can
be converted into finding the real roots of a univariate polynomial of degree
(
)
. The polynomial is of the form:
We are interested in the largest real root
of
the polynomial such that
{
[
]
[
]
Empirically, it has been found that the normal range for
for natural images is between 0.5 and 0.8. This implies that
there are only 4 values of
(2, 3, 4 and 5) that satisfy
the above requirement. As described in their paper, Krishnan and
Fergus resort to a pre

computed LUT to solve this part of the sub

problem efficiently.
The range of values for
limits
the degree of the polynomial between 3 and 6.
For several values of
(from 1 to 256 in multiple
s of
2
) and
(200,000 values uniformly in the range [

10,10]), we determine the optimal
using an efficient and ro
bust root solver
described in [6
].
For the sake of completeness, Algorithm 1 below reviews the main steps inv
olved in the Krishnan

Fergus algorithm.
3.
GPU
IMPLEMENTATION
The nature of the Krishnan

Fergus deconvolution algorithm lends itself nicely to a GPU implementation. The first sub

problem in the alternating scheme is to solve for
w
, which as previously discussed, can be approximat
ed very quickly by
usi
ng a look
up table. The CUDA memory hierarchy provides efficient means for implementing such a table, including
the constant, texture, and global memory spaces. The second sub

problem, which solves for
x
, can be found directly in
the Fourier domain using
only three FFTs per ite
ration of the algorithm. Nvidia
provides an implementation of the Fast
Fourier Transform in their CUFFT library, which includes several parallel algorithms each with different performance
(
(
)
)
(5
)
(
)
(
(
)
)
(
)
(6
)
Algorithm 1:
Fast image deconvolution using hyper

Laplacian priors
Input:
Blurred image
; blur kernel
; regularization weight λ; and prior exponent
regime parameters:
,
,
; number of inner iterations
Output:
Deblurred image
Precompute constant terms in eqn. (4).
Generate a LUT by solving eqn. (6)
while
do
for
i = 1 to
do
Given
, solve eqn. (5) for all pixels using LUT to obtain
Given
, solve eqn. (4) to obtain
end for
end while
return
and accuracy characteristics. Based on thes
e observations and previous experience with CUDA, we believed that
mapping the Krishnan

Fergus algorithm onto the GPU could
lead to significantly faster
running times.
Our initial CPU

based implementation of the Krishnan

Fergus algorithm was on a system wi
th dual quad

cores and the
running time was almost two seconds for a relatively small image (710x470 pixels). As we ported the algorithm over to
the GPU using CUDA, we utilized many performance optimizations to achieve significant gains in running time.
We
have organized these optimizations into four categories and present them below in sections 3.3

3.7. In the next two
sections, we briefly discuss an important tool for identifying bottlenecks, as well as the CUFFT library.
3.1
CUDA Visual Profiler
Before
discussing our optimizations, we want to quickly emphasize the importance of profiling the CUDA code in order
to identify bottlenecks and inefficient code. Throughout our effort to port the algorithm to the GPU, we routinely made
use of the CUDA Visual P
rofiler (CVP), which provides a highly detailed report on the number of times operations
occurred, together with the amount of time they took to complete. Many other statistics are also included on a per

kernel
basis, such as the occupancy, registers used
, and number of instructions issued. In particular, the GPU Time Height Plot,
shown in figure 1, illustrates with a bar chart how long each kernel and memory copy operation took to complete,
thereby quickly highlighting which operations should be optimize
d first. In this snapshot, the second, third, and last bars
show the time it takes to copy data to and from the GPU. The remaining bars, which are all roughly the same
magnitude, highlight the running times for all of the kernels in our implementation.
After the I/O operations, the next
highest bars are all from kernels executed by the CUFFT library, and thus outside of our control. The next kernel for us
to optimize would be “compute_out” which is called once per iteration of the algorithm, and whose p
urpose is to
perform the final computations when solving for
x
. In the left column of the figure, the number of times (i.e. sessions)
that we executed the profiler and saved the output (up to this point) is shown to be 23. CVP is a great help in identify
ing
your bottlenecks and illustrating any performance improvements that result from code changes. It should be used often.
Figure
1
: T
he CUDA Visual Profile
r time height plot.
The second, third, and last
bars
plot
the time
(microseconds)
to co
py the data to and from the GPU.
All other
bars represent each
of the kernels executed
.
Figure
2
: The CUDA Visual Profiler time summary plot. After padding the images to be a power of two, the
four most time
consuming kernels are all within the CUFFT library, and occupy 47% of the total time.
3.2
CUFFT library
The CUFFT library is Nvidia
’s parallel implementation of the Fast Fourier Transform on the GPU. The library contains
several algorithms for FFT each with
different accuracy and performance capabilities, based upon the size of the input
data. Although the CUFFT library supports single and double precision, real and complex values, and forward and
backward transformations, it does have a few limitations. T
wo and three

dimensional transforms must contain fewer
than 16384
elements in each dimension, and t
he limit for one

dimensional transforms is about 8 million elements, which
equates to an image th
at is roughly 2800x2800 pixels. Bases on some initial exper
iments, 1D transforms were slightly
faster than 2D transforms, and thus were used in our implementation of the Krishnan

Fergus algorithm.
In our initial implementation, we simply called the FFT routines with data whose size was based upon the original inpu
t
image, e.g., 710 pixels wide by 470 pixels high. Because the CUFFT library is optimized for transform sizes that are
powers of small prime factors, e.g. powers of two, we then tried padding the original image (and necessarily all
associated blur kernels
and filters), to be the next largest power of two in each dimension. In the example above, the
new padded image was 1024 pixels wide by 512 pixels high, an increase of over 57% in the total number of pixels.
However, the running time of the entire Krishn
an

Fergus algorithm was then cut by almost 80%. The size of the input
data clearly
ha
s a significant impact on which internal algorithm is used in the CUFFT library. We explore this fact
further section 4
.2
.
Figure 2 is another snapshot of the CVP that
shows the time summary plot for the kernels and memory transfers to and
from the GPU. In parentheses next to the name of each kernel is the number of times it was called. Even after padding
our data to be a power of two in each dimension, the CUFFT kernel
s, whose names all begin with spRadix, are still the
four most expensive kernels in our implementation and they sum to over 47% of the current running time. There are
several alternative implementations of FFT
on the GPU [7,8,9
] which could potentially le
ad to faster running times for
our algorithm, but we have not yet experimented with them.
3.3
Memory transfers
Transferring data between the host machine and the GPU is often one of the biggest bottlenecks for CUDA
implementations, and this algorithm was no e
xception. As can be seen in figure 1, the two longest operations are when
we copy the
lookup
table from the CPU to the GPU (second bar), and again when we copy the final image back from the
GPU (last bar). What this figure does not show are the other num
erous spikes related to data movement that we were
able to reduce or eliminate by being more careful about memory transfers. These optimizations fall roughly into two
categories: minimizing the amount of data transferred, and making the transfers faster.
One of the most important decisions to make when implementing any algorithm for the GPU is whether to use single or
double precision floating point variables. It is easy to fall into the mindset of using double precision for greatest
accuracy, but one rea
lly must evaluate if this is absolutely necessary when programming for the GPU. One reason for
this is because N
vidia
GPUs double precision performance is only 1/8
th
that of single precision (until just recently when
the Fermi architecture was released, a
t which point the double precision performance increased to one half that of single
precision, as one might expect). Another reason is the increase in all GPU resources needed for double precision:
registers, on

chip memory, on

device memory, and bandwidt
h required to transfer the data to and from the GPU.
Our initial
lookup
table was over 100MB in size, and included four values of alpha and 17 values of beta. Each entry
was also stored as a floating point using double precision. Upon more careful inspe
ction of the output images when
using single versus double precision variables, we realized that the differences were imperceptible (on the order of 1 or 2
units difference out of 255 for the red, green, or blue components of the image) and therefore singl
e precision would be
sufficient for our needs. This issue needs to be evaluated for any problem domain, but for image data, a difference of
this small magnitude in the image will not be noticeable to the end

user, but the difference in running times will
be. In
addition to using floats, we also limited the values of alpha and beta to be exactly what was required: we could limit
alpha to only one value, and beta to only six values. In the end, our
lookup
table was reduced to only 4MB.
As discussed in the
CUFFT section, we decided to pad our images, blur kernels, and image filters to all be a power of
two in each dimension. Initially we did this on the CPU and then transferred the data to the GPU. However, by using
CVP, we immediately realized how much tim
e this was taking and modified our approach. In our current version of the
code, we only transfer the bare minimum data required, and then pad the data on the GPU itself using very simple
kernels. This provided a nice reduction in data movement for the i
nput image, but a more significant savings occurred
for the blur kernel and image filters, which are much smaller, e.g. the Gaussian blur kernel is only 13x13 pixels. The
reduction in times to copy and pad the image, blur kernels, and image filters ranged
from 50% to as much as 95%.
In addition to limiting the amount of data to transfer to and from the GPU, there are also several standard optimizations
that help make the transfers faster. One of the optimizations we used is page

locked memory on the host,
which
prevents the operating system from paging the memory out, and therefore the GPU’s access to that data over PCI

Express can be at full

speed. Another well

known optimization that we used is that of asynchronous memory copies.
Rather than waiting fo
r the
lookup
table to finish being copi
ed to the GPU, we can use Nvidia
’s asynchronous API
which allows kernels to be executing in parallel with the memory copy. Thus, we can be running our simple kernels to
pad the image data to hide the latency of the m
emory copy.
3.4
Memory access
Although each GPU multiprocessor has shared memory and 32

bit registers on the chip itself, many data accesses that
the kernels make are to the global memory on the graphics card. Global memory is not cached (on all but 2.x compu
te
capability devices) and these accesses have a much longer latency than those to on

chip memory and therefore great care
must be taken
to fully optimize them. Nvidia
’s CUDA C Best Practices Guide goes into depth on how to make global
accesses coalesced,
or aligned, and claims this may be the single most important performance consideration when
programm
ing for the CUDA architecture [10
]. We agree with this assertion and note that this optimization has become
commonplace when writing CUDA code. To verify
that each of our kernels
was
making aligned accesses, we again
used CVP to see how many 32

, 64

, and 128

byte global loads and stores were made. By analyzing these numbers, we
could optimize our code and reduce the number of 32

and 64

byte accesses, th
ereby increasing our overall performance.
As global memory is not cached, it is important to know about the texture and constant memory spaces available in the
CUDA architecture as well. These memories are cached and are ideally suited for memory that is
going to be loaded
once and then accessed in a read

only fashion. Our
lookup
table falls perfectly into this scenario. Our original
implementation used global memory to store the
lookup
table, but by accessing the data using a 1D texture, the
“solve_imag
e” kernel’s running time was cut by 50%. As this kernel is called twice per iteration of the algorithm (12
times in all), the improvement in this one kernel reduced the total running time by 5%.
3.5
Parallel execution
Running algorithms in parallel is the cor
e of GPU computing. Serial operations can probably best be executed on the
CPU, while parallel portions can usually be efficiently mapped to the GPU, provided some care is taken to maximize the
parallelism given its fixed resources. We have already discu
ssed one standard optimization technique regarding
parallelism and that is overlapping memory transfers between the host machine and the GPU device with other work.
These asynchronous copies allow kernels to be executing on data already loaded onto the de
vice, while new data is
copied over to it, effectively hiding the latency involved in the data copy.
For GPU devices with CUDA compute capability 1.x, only one kernel can be executed at a time, although the kernels are
run in multiple thread blocks. To ke
ep the multiprocessors busy on the GPU, there should be at least as many thread
blocks as there are multiprocessors, but ideally there will be more in order to hide the latency that exists when a warp (a
grouping of 32 threads) is not ready to execute it’s
next instruction, e.g. if it must wait for data to be loaded from global
memory or some other operations to complete. When one warp stalls, another warp can be quickly scheduled to execute
and effectively hide the latency of the memory access.
In our imp
lementation, we have maximized the number of warps and thread blocks that are running on the GPU by
carefully managing the limited GPU resources, including registers and shared memory that are common to each thread
block. By carefully rewriting our GPU ke
rnels, and in some cases subdividing them into smaller kernels, we were able
to reduce the number of registers and shared memory needed, thereby increasing the number of blocks that could be
executed in parallel. One immediate benefit of using floats inst
ead of doubles is that the number of registers used was
cut in half. Another means of reducing the number of registers was by eliminating temporary variables that were
convenient when writing the code, but not strictly necessary for correct execution. Fo
r example, by reordering some of
the computations in one kernel, we could avoid several temporary variables that increased our register use.
To determine how many registers and how much shared memory were used by each kernel, we ran our code through the
CV
P, which automatically reports these statistics. We could then plug this information into the CUDA Occupancy
Calculator to determine where the bottlenecks were and how we could best increase the amount of parallelism in the
kernels. The Occupancy Calcula
tor is really just a simple spreadsheet that computes this information for you once you
have entered only four pieces of information: CUDA compute capability of the GPU, number of threads per block,
registers needed per thread, and shared memory needed per
thread. Only the first item is completely determined by the
graphics hardware you have, while the three remaining items can be influenced by the user and the implementation.
Occupancy is loosely defined as the ratio of the number of warps running concur
rently on a multiprocessor divided by
the maximum number of warps that could be running concurrently based on the hardware limits of the GPU. Having an
occupancy of 1 (or 100%) is the best case, but does not necessarily translate to improved performance o
ver lower ratios.
However, a kernel with a low occupancy may prevent the multiprocessor from hiding latency adequately for memory

intensive kernels.
3.6
Instruction throughput
Other optimizations to CUDA code involve trying to minimize the number of cycles
required to get the same work done.
In other words, by reducing the number of cycles needed, the GPU can process more data in the same amount of time,
thereby increasing the amount of data processed per instruction, i.e. the instruction throughput. We ha
ve utilized these
optimizations in several forms, such as minimizing slow arithmetic instructions, minimizing divergent warps, and
avoiding unnecessary type conversions.
Arithmetic instructions can vary significantly in the number of operations per clock c
ycle on the GPU. For CUDA
compute capability 1.x, eight 32

bit floating

point adds can occur in each clock cycle, but only
one 64

bit floating

point
add [11
]. 32

bit integer multiplications on the other hand, require several instructions by themselves, s
o are very
inefficient by comparison. However, 24

bit integer multiplications (e.g. __[u]mul24) can again be done eight times per
clock cycle, and therefore should be utilized when appropriate. For example, when computing the index of a specific
thread i
n a kernel, one could replace the standard
int index = blockDim.x * blockIdx.x + threadIdx.x
, with the
following:
int index = _umul24(blockDim.x, blockIdx.x) + threadIdx.x
. Another significant hit is using division or
modulo inside the kernel. For many o
f our kernels, since they involve two

dimensional images implemented as one

dimensional arrays, we are often dividing the index by the number of columns to determine which row of the image we
are going to operate on. Rather than doing this repeatedly insid
e the kernels, we could instead multiply by the reciprocal
of the number of columns, which is fixed for a given image and can be computed once and stored in constant memory.
Divergence refers to the situation when threads in a single warp take different co
de paths due to a branch condition, e.g.
if
and
switch
statements. As a result, the different execution paths must be serialized, thereby reducing the amount of
parallelism achieved. For all of our kernels, we made great effort to eliminate divergence wi
thin warps. For example,
by carefully choosing the number of threads per thread block, and by padding our images to be of sizes that are a power
of two, we could guarantee that every thread in every thread block was assigned the work associated with one o
f the
pixels in the image. Thus, we could immediately remove all of the if

statements that check boundary conditions (e.g.
if
(index < num_pixels
)), and cause divergence within warps.
We have already spoken at some length on using floats instead of double
s, however care must be taken to avoid all
implicit conversions to doubles. By using the

keep
compiler option and examining the temporary files, we were able
to find several instances of literal constants that were being converted to doubles automatical
ly. To avoid this, the
simple scheme of adding an “f” after the literal prevents this from happening, e.g. 1.0f. Another good lesson was to
remove the
sm_13
compiler directive, which prevents the compiler from using doubles at all, and thus warning messa
ges
were given at compile time that doubles were being demoted to floats. Such messages allow you to easily find and fix
those problems. Another optimization is to use the single

precision functions when appropriate. For example, using
rintf to round a
floating point variable to an integer requires only one instruction, whereas roundf() requires eight
instructions. There are many other such examples of this as well, including truncf(), ceilf(), and floorf(). Charts
detailing these instruction counts ca
n be found in the NVI
IDA CUDA C Programmer’s Guide [11
].
4.
EXPERIMENTAL RESULTS
Our experiments were conducted on a dual quad

core (Intel Xeon E5506 @ 2.13GHz) 64

bit Linux workstation, wit
h
12GB of RAM and an Nvidia
GeForce GTX 260 graphics adapter. The GT
X 260 has 896MB of global memory, 27
multiprocessors (216 cores total), and CUDA compute capability 1.3. The CUDA driver at the time was version 3.0.
4.1
Running Times
Figures
8
and
9
are examples of the results of our implementation of the Krishnan

Fergus algorithm for the GPU. The 3

channel color images in figure 8
are of size 710 x 470 pixels, and were processed in 0.066 seconds (i.e. the equivalent of
over 15 frames per second (fp
s)). The 1

channel version of this image required 0.024 seconds (about 41 fps), roughly
one

third of the time for the RGB image. The time is not exactly one

third due to the fixed start

up costs associated with
the implementation, such as copying the
loo
kup
table, kernels, and filters to the GPU, which only happen once
regardless of the number of channels (or images) being deblurred. These times indicate that our algorithm was
processing 4.8MPel/sec for 3

channel images, and 13Mpel/sec for 1

channel imag
es. Similarly, the one

and three

channel deblurring
times for the images in figure 9
are respectively 0.041 seconds and 0.116 seconds (or the equivalent
of 24.4 fps and 8.6 fps). These images are slight
ly larger than those in figure 8
at 768 x 512 pixel
s, and achieved a
slightly lower throughput of 3.2Mpel/sec for the 3

channel image, and 9.1 Mpel/sec for the 1

channel image. This
reduction is due to the larger padded size of this image (1024x1024) compared to the one in figure
8
(1024x512).
Although t
he number of rows in the image in figure
9
is already a power of two (i.e. 512), we have to pad all images by
half the kernel size along each border, followed by the additional padding up to the next larger power of two. Thus we
pad this image by 6 pixels
around the border (increasing the width and height of each image by 12 pixels), which then
forces us to pad up to 1024 in each dimension to achieve our desired size for the CUFFT library.
Figure
3
: Running time of our GPU implem
entation of the Krishnan

Fergus algorithm.
We plot the number of megapixels in each image versus the running time.
F
igure 3 shows the running time of our GPU implementation on over 40
(3

channel)
images of various sizes. Upon
initial inspection, the runn
ing times appear to increase linearly with respect to the number of megapixels in the image,
and in fact, the equation of the linear regression line was calculated to be f(x) = 0.314x + 0.0019, with r
2
= 0.92. This
implies that our implementation can proc
ess over 3 Mpel per second
for 3

channel images
. On closer inspection
however, we noticed a staircase pattern in which distinct jumps in running times were clearly evident as the images sizes
increased. This effect is an artifact of our padding the size
of the images up to the next highest power of two for
performance considerations related to the CUFFT library. We explore this behavior further in the following section.
4.2
CUFFT Performance
As noted in section 3.2, by padding our image sizes to be the next
largest power of two, the overall running time of the
algorithm was significantly reduced. This improvement was a direct consequence of the reduction in time taken within
the CUFFT library.
Documentation on the library [12
] indicates that performance i
s best when the size of the
transformation is a power of small prime factors, e.g. 2, 3, 5, 7. While our initial optimization based on powers of two
produced significant improvements, they also seemed excessive for images that were already of sizes that w
ere powers
of two because we would have to quadruple the number of pixels processed (essentially doubling the size in each
dimension) to get the performance gain. We asked ourselves if we could not find smaller sizes that were also easily
divided into sma
ll prime factors and that would also offer similar performance gains.
We conducted an experiment in which we started with an image that was 512 x 512 pixels in size and then padded it by
12 pixels in each dimension (as required by our implementation for pr
ocessing the image) to produce an image that was
524 x 524. Next, we padded the image to be
n x n
pixels, where
n
ranged from 525 all the way up to 2750 pixels. For
each padded image, we ran our algorithm to determine the overall running time. The goal
was to determine which sizes
worked best with the CUFFT library and thus produced the best overall running times. Figure 4
(left)
shows the plot of
the padded image size versus running time for all values. Although the figure is quite cluttered, the over
all t
rends are
illustrated. Figure 4 (right)
shows only the first 100 entries of that plot to highlight more clearly the behavior as we
increase the dimensions of the image. The running times appear to jump all over the place, but there is a clear patter
n:
the longest times occur exactly when the padded size equals a single prime number, such as 541, 547, 557, 563, etc. If
you ignore those times, you will see another set of values that are higher than the rest and also appear on a monotonically
increasin
g curve. These times refer to padded sizes that are equal to a small prime factor, e.g. 2, 3, 5, multiplied by a
large prime number. For example, the first four such entries are 526 (2 * 263), 537 (3 * 179), 538 (2 * 269), and 542 (2
* 271) . The most i
mportant information in these plots, however, are the
fastest
running times so that we can potentially
optimize our padded sizes for all images and thereby produce the fastest results.
Figure
4
: Line plots showing padded image
size versus running time. The plot on the right shows
a zoomed view of the first 100 entries in the left plot.
Prime n
umbers are clearly inefficient.
To find these values, we sorted the results (pairs of sizes and running times) according to running tim
es and then only
kept those sizes that were monotonically increasing. For example, the six fastest running times for
the sizes reflected in
figure 4
were 540, 576, 600, 528, 546, and 625. If we only keep the monotonically increasing values, we get 540, 5
76,
600, and 625. The reason for this is that it makes no sense to ever pad an image up to size 528 x 528 because we found
that padding to 540 x 540 produces even faster running times. For this example image, the running times for these four
optimized si
zes are 0.0772, 0.0843, 0.0849, and 0.0963 seconds. Again, this is attributed to the fact that these numbers
are nicely factored into small prime numbers, which allows the CUFFT library to use the most efficient kernels. For this
experiment, after sortin
g and keeping the monotonically increasing sizes, we obtained a
lookup
table that contains 24
entries: 540, 576, 600, 625, 648, 720, 729, 1024, 1080, 1125, 1296, 1323, 1536, 2048, 2058, 2160, 2304, 2400, 2500,
2592, 2625, 2646, 2700, and 2744. As you can
see, there are several sizes that are faster than just padding to 1024, as
our original algorithm would do. Our new algorithm for deciding what sizes to pad images up to are based on this look
up table: if the original image size is less than or equal to
540, pad to 540, otherwise consider the remaining entries in
increasing order until the size is less than the value in the
lookup
table. For images that are even larger than 2750, or
smaller than 512, we simply used the next larger power of two. However,
2750 appears to be the largest image that we
can process with our algorithm on the GXT 260 without running out of graphics memory.
To test our new algorithm, we conducted another experiment in which we resized a single image (originally 1000 x 1000
pixels
) to be of size
n x n
pixels, where
n
ranges from 1 to 2000. For each resized image, we ran our software and
recorded the overall running time for our origi
nal algorithm, shown in figure 5
, as well as our
new algorithm, shown in
figure 6
. The staircase
effect that we described earlier is clearl
y evident in figure 5
whenever we hit a power of two.
The inset image highlights the behavior for the smallest of values, which are not discernible in the larger plot. The
results from our new algorithm in figure
6
also exhibit the staircase behavior, but at much finer levels, which allows our
running times to be significantly reduced for many image sizes. The inset image again shows a zoomed view of the
smallest image sizes, although we have truncated the data t
o start at 500 pixels, since the algorithms are the same up
until this point. One thing to note is that these results are for square images. When the width and height of the images
are not equal, additional staircase effects will occur. Another interest
ing point is that our new algorithm supports image
sizes up to 7 Mpel, whereas the original one caused us to run out of graphics memory around 4 Mpel because the padded
sizes simply became too large.
Figure
5
: Padded image size plotted against running
time for our original algorithm of padding up
to the next largest power of two. The inset image shows a zoomed view of the smallest sizes.
Figure
6
: Padded image size plotted against running time for both our original and optimized padded
algorithms.
The inset image shows a zoomed view of the smallest sizes, starting at 500 x 500 pixel images.
As a final test, we applied our new padding algorithm to the same input images from figure 3. The new running times,
(solid circles)
along with the old value
s
(crosses)
, are shown in figure
7
. In all but a single case, the optimized algorithm
produces times that are as good as, or better than, the original algorithm. In some cases, when the algorithms pad to the
same sizes, there are normal fluctuations in r
unning times that have one or the other slightly faster. We discount these
discrepancies and consider the results equivalent. In other cases, the new algorithm results in a reduction in time of
between 10% and 40%. The one significant exception is for a
n image that is of size 1600 x 1200 pixels. Based on the
original algorithm, the padded size is 2048 x 2048, or 2
22
, but the new algorithm pads it to 2048 x 1296, which can be
written 2
15
* 3
4
. In this example, the new running time is almost 15% slower t
han the original. The newest
documentation on the CUFFT library does provides some new guidelines as to what sizes will be the best for their
implementation. In fact, they comment specifically on this type of example where the two sizes are close, but th
e larger
number is a multiple of a single factor, but the smaller number is the multiple of two factors. Further ex
perimentation
could shed more
light on these optimal sizes, but it would require an exhaustive
test of all possible image dimensions
.
5.
CONCLUSIONS
We have presented our implementation of the Krishnan

Fergus deconvolution algorithm
using Nvidia
’s CUDA
architecture. The algorithm requires efficient solutions to two sub

problems, each of which map well onto the GPU.
One sub

problem can be
approximately solved using a lookup

table, which maps perfectly to the cached texture
memory. The other sub

problem can be solved using FFTs, which can be efficiently computed in parallel thanks to the
CUFFT library provided
by Nvidia
. Programming for th
e GPU is a non

trivial task so we have carefully described the
numerous optimizations employed to achieve these interactive running times. We have also made several observations
regarding performance of the CUFFT library, and how through carefully selecti
ng the transform sizes, one can see
significant improvements. For 3

channel color images, our GPU implementation can process over 3 megapixels per
second (often more), which is over 27 times faster than the original CPU implementation by Krishnan and Ferg
us.
However, our GPU implementation is not without limitation. Currently we can process images up to around 2700 x 2700
pixels, after which we run out of graphics memory on the GTX 260 (which has 896MB). For newer graphics cards, with
additional memory,
we could push this limit higher, but we could also achieve this by being more judicious in our
memory usage on the graphics card.
Figure
7
: Running time of our GPU implementation of the Krishnan

Fergus algorithm using our
new padding algorithm. In all b
ut two cases, the new algorithm results in faster running times.
REFERENCES
[1]
Krishnan, D. and Fergus, R., “Fast Image Deconvolution using Hyper

Laplacian Priors,” NIPS, 1033

1041 (2009).
[2]
Domanski, L., Vallotton, P. and Wang, D., “Two and Three

Dimensional I
mage Deconvolution on Graphics
Hardware”, MODSIM09 (2009).
[3]
Richardson, W. H., “
Bayesian

Based Iterative Method of Image Restoration
”,
Journal of the Optical Society of
America
(JOSA), 62
(1):55
–
59
(1972).
[4]
Lucy, L. B.
, “
An iterative technique for the recti
fication of observed distributions
”,
Astronomical Journal
, 79 (6):
745
–
754 (1974).
[5]
Geman, D. and Yang, C., “
Nonlinear image recovery with half

quadratic regularization
”, IEEE PAMI, 4:932
–
946
(1995).
[6]
Krishnan, S., Foskey, M., Culver, T., Keyser, J. and
Manocha, D., “
PRECISE: Efficient Multiprecision Evaluation
of Algebraic Roots and Predicates for Reliable Geometric Computations
”,
Proceedings of Seventeenth Annual
ACM Symposium on Computational Geometry
, 274

283 (2001).
[7]
Govindaraju, N., Lloyd, B, Dotsenk
o Y., Smith, B., and Manferdelli, J., “High Performance Discrete Fourier
Transforms on Graphics Processors”, ACM/IEEE Supercomputing (2008).
[8]
Nukada, A, and Matsuoka, S., “Auto

Tuning 3

D FFT Library for CUDA GPUs”, ACM/IEEE Supercomputing
(2009).
[9]
Gu, L, L
i, X., and Siegel, J., “An Emperically Tuned 2D and 3D FFT Library on CUDA GPU”, 24
th
International
Conference on Supercomputing (2010).
[10]
NVIDIA CUDA
TM
: CUDA C Best Practices Guide, version 3.2,
Aug 2010.
[11]
NVIDIA CUDA
TM
: CUDA C Programming Guide,
version
3.2, Oct 2010.
[12]
NVIDIA CUDA
TM
: CUDA CUFFT Library, v
ersion PG

05327

032_V02, Aug 2010.
Figure
8
: Original image (top) and deblurred image (bottom
).
The original image size is preserved
to prevent artifacts due to resizing.
Figure
9
: Original image
(top) and deblurred image (bottom
).
The original image size is preserved
to prevent artifacts due to resizing.
Comments 0
Log in to post a comment