Masaryk University
Faculty of Informatics
Masaryk University
Faculty of Informatics
Automatic Bugﬁnding Techniques for
Linux Kernel
Dissertation Thesis Topic
Jiří Slabý
Brno,spring 2010
Supervisor:prof.RNDr.Antonín Kučera,Ph.D.
Supervisorspecialist:Mgr.Jan Obdržálek,PhD.
Supervisor’s Signature:......................
GPU Acceleration of
Image Processing Algorithms
Dissertation Thesis Topic
Pavel Karas
Centre for Biomedical Image Analysis
Supervisor:Doc.RNDr.Michal Kozubek,CSc.
Supervisorspecialist:Mgr.David Svoboda,PhD.
Brno,September 3,2010
Supervisor’s signature:
Acknowledgement
I would like to thank supervisors,both Doc.Kozubek and Dr.Svoboda for their
patience and their suggestions.I thank my colleagues for their ideas and inspiration.
My thanks belong also to my friends and parents for their endless support.
i
Contents
1 Introduction 1
2 Report on Current Results 5
2.1 Parallel Computing Architectures and Models................5
2.1.1 NVidia CUDA..............................5
2.1.2 ATI Stream................................7
2.1.3 OpenCL..................................7
2.1.4 Other architectures............................7
2.2 GeneralPurpose Computation on GPU....................9
2.3 Image Processing Acceleration.........................12
2.4 Image Formation.................................13
2.5 Image Enhancement...............................15
2.5.1 Tone Mapping...............................15
2.5.2 Image Restoration............................16
2.5.3 Geometrical Methods..........................17
2.6 Image Analysis..................................18
2.6.1 Morphological Processing........................18
2.6.2 Edge Detection..............................19
2.6.3 Image Segmentation...........................20
2.6.4 Feature Extraction............................21
2.6.5 Shape Analysis..............................23
2.6.6 Image Registration............................23
2.7 Decomposition of Large Images in Image Processing............24
3 Dissertation Thesis Intent 27
References 29
4 Summary of Study Results 41
ii
1 Introduction
Image processing is a rapidly developing ﬁeld based both on mathematics and com
puter science.It is naturally connected to and strongly inﬂuenced by image acqui
sition.Community of users is wide and includes not only common customers with
digital cameras,but also astronoms analysing data from telescopes and spacecrafts
[GW02] [SM06],specialists dealing with magnetic resonance imaging (MRI) or com
puted tomography (CT) [Cas96] [GW02] [Jäh05],and biologists working with optical
microscopes [Ver98] [Jäh05].The users’ demands for image quality,in terms of res
olution,signaltonoise ratio or dynamic range,are constantly growing.As a result,
both the image acquisition devices and image processing algorithms are developing in
order to satisfy the users’ needs.
On the other hand,the complex image processing methods can be very time
consuming and their computation can be challenging even for recent computer hard
ware.Therefore,an integral part of the ﬁeld is optimization,so that the complex
algorithms could be performed in a reasonable time.For example,if a patient under
goes an magnetic resonance or ultrasound investigation,it is essential to have the data
processed in several minutes so that a specialist can consult the results with the patient
and propose next steps within a single visit [GLGN
+
08].In other applications like
userassisted image segmentation,the data need to be processed realtime [HLSB04].
The most important computational hardware for image processing is Central Pro
cessing Units (CPU),for their versatility and tremendous speed.According to the
oftquated Moore’s Law,the CPU’s computing capability is doubled every two years.
However,this trend is now inhibited as the semiconductor industry has reached phys
ical limits in production.Soon,it will be no longer possible to decrease the size of
transistors,and therefore to increase the clock frequency.Thus,manufacturers seek for
a different ways to enhance the CPU speed.The most signiﬁcant trend is to increase
CPU’s ability to process more and more tasks in parallel,by extending the instruction
pipeline,involving vector processing units,and increasing the number of CPU cores.
In contrast to CPU,Graphics Processing Units (GPU) are recently considered to be
very efﬁcient parallel processing units.While their original purpose was to provide
a graphic output to users,their computing capabilities have been getting more and
more complex and their performance has overcome common CPUs,in terms of both
computing speed (in ﬂoating point operations per second) and memory bandwidth
[GLG05] [CBM
+
08] [CDMS
+
08] [PSL
+
10] [LKC
+
10] (Figure 1).Besides,their perfor
mance growth is not limited by current manufacturing technology,so the gulf between
CPU and GPU computational power is still getting wider.The GPU architecture bene
ﬁt from a massive ﬁnegrained parallelization,as they are able to execute as many as
thousands of threads concurrently.Attempts to utilize GPUs not only for manipulat
ing computer graphics,but also for different purposes,lead to a new research ﬁeld,
called “Generalpurpose computation on GPU (GPGPU)”.The graphic cards provide
huge computational performance for a good price.For instance,NVidia GeForce GTX
480,a top desktop graphics card with 480 streaming processors,built on a newFERMI
architecture,is slightly cheaper than the quadcore CPU Intel Core i7960 (as to July
2010).Furthermore,the recent computer mainboards support up to 4 PCIe x16 slots
1
Figure 1:A performance gap between GPUs and CPUs [KmH10].
offering space for 4 GPU cards.With multiple slots occupied,a common affordable
single PC can be turned into a supercomputer,literally.
This approach also shows some caveats.First,there used to be no highlevel pro
gramming languages speciﬁcally designed for general computation purposes.The
programmers instead had to use shading languages such as Cg,High Level Shad
ing Language (HLSL) or OpenGL Shading Language (GLSL) [OLG
+
05] [CDMS
+
08]
[RRB
+
08],to utilize texture units.This no more poses an obstacle.In 2004,the Brook
[BFH
+
04] and Sh [MDTP
+
04] programming languages were released as an extension
of ANSI C with concepts from stream programming.In 2007,NVidia released the
ﬁrst version of Compute Uniﬁed Device Architecture (CUDA) [CUD10e] [CUD10c],a
programming model to code algorithms for execution on the GPU.A more general,
nonproprietary framework,developped by Khronos Group,is Open Computing Lan
guage (OpenCL),which can be used for heterogenous platforms,including both the
CPU and the GPU [Ope10c] [Ope10b].
Second,not all methods are suitable for a massive parallel processing.The GPU
performance is limited by several conditions:a) the ratio of parallel to sequential
fraction of the algorithm,b) the ratio of ﬂoating point operations to global memory
accesses,c) branching diversity,d) global synchronization requirements and e) data
transfer overhead [CDMS
+
08] [PSL
+
10].Image processing algorithms in general are
good candidates for GPU implementation,since the parallelization is naturally pro
vided by perpixel (or pervoxel) operations.Therefore,they naturally fulﬁll the a)
condition,particularly.This was conﬁrmed by many studies examining GPU accelera
tion of image processing algorithms.Nevertheless,programmers have to keep in mind
2
Figure 2:GPU performance in single and double precision [Gri10].
also the other conditions in order to keep their implementations efﬁcient.An enor
mous effort has been recently spent on accelerating image processing on GPU.Many
examples of successful implementations are cited in the following section of the text.
Third,programmers should be aware that graphic cards perform best in a single
ﬂoating point precission.The performance in a double and higher precissions is signif
icantly worse.For instance,AMD/ATI GPUs provide approximately 5 times smaller
computational power in a double precision compared to a single precision [AMD10],
NVidia GPUs provide even 16 times smaller power [CUD10e] [Tes09].The latter should
change with a release of TESLAcards built on a recent FERMI architecture [Tes10] (Fig
ure 2).In many applications,this does not pose an obstacle,since a single precision
can be enough,as will be shown in the following section.In some cases,a computation
can be divided into two parts,of which the critical one (in means of precision) can be
executed on CPU,whereas the other one is executed on GPU.In the other cases,a
careful attention needs to be paid to precision,and the users should probably wait for
better GPU architectures.
Fourth,one of the bottlenecks of graphic cards is their relatively expensive,there
fore small global memory.A common modern GPU provides approximately 4–8 times
smaller global memory than a common CPU.This can pose signiﬁcant problems for
programmers in their effort to implement algorithms on GPU.The amount of pro
cessed data can be enormous,especially in applications which produce three or more
3
dimensional images,such as optical microscopy or computed tomography.Another
example is magnetic resonance imaging,using diffusion tensor images.In this case,
the image domain is relatively small,but the amount of information stored for every
single image element is huge.As a result,the size of the image can exceed the size of
GPU global memory,or,at best,there is enough global memory to allocate the space
for the image itself,but not for auxilliary data structures.
The simple fact,that image size exceeds the amount of available memory,does not
itself pose an obstacle for a potential GPU implementation.There is a large class of
image processing algorithms which can be referred to as methods with local knowledge.
In this case,in order to process each image element only local information about the el
ement and its closest neighbours is needed.Image tonality adjustation,edge detection
ﬁlters,and linear combination of multiple images are typical examples of this class.
Linear and nonlinear ﬁlters with relatively small kernels,as well as dilation and ero
sion with small structural elements can be also considered members of this class.Due
to nature of local algorithms it is easy to split data into small chunks.Consequently,
these parts can be processed independently,so only little effort has to be made on
problem decomposition,actually.As will be shown in our survey,this is the case of
the most current GPU implementations dealing with huge images.
On the contrary,algorithms,which we refer to as methods with global knowledge,
need to store the entire image.Fourier transform (and fast Fourier transform as well),
Hough transformor wavelet transformare good examples of this class.Many methods,
for instance,convolution and deconvolution with large kernels,are derived fromthese
transforms,and therefore should be also considered global.The Euclidean distance
transform,the watershed algorithm,and generally all segmentation methods based on
a functional minimization are other candidates.In this case,it is necessary to seek for
nontrivial methods to decompose a problem into several parts which are independent
fromeach other,in order to process data on architectures with a small memory.
In this decomposition process,it is important to pay attention to task dependency.
The most desired goal is to split data into independent parts,so that they can be
processed in parallel,in our case,on multiple GPU cores.Also,it may be possible to
deal with huge data that cannot be stored not only in GPU memory but also in CPU
memory.Without decomposition,the computation would have to be performed in a
larger,but much slower external memory.
Intention of Thesis.The main goal of this dissertation thesis should be acceleration
of image processing algorithms with respect to huge data.Since our laboratory,Centre
for Biomedical Image Analysis (CBIA),is focused on optical microscopy,we deal with
large 3D images;therefore,it is necessary to have tools for processing such data.The
thesis should focus on decomposition of image processing algorithms,especially of
the class of global methods.For implementation we choose the GPU platform,since
it is suitable not only for testing,but also for considerable speedup of computational
intensive algorithms.Although many papers concerning GPU acceleration were pub
lished,very few of them dealt with huge images and problem decomposition;there
fore,lot of questions remain unanswered in this ﬁeld.
4
2 Report on Current Results
In this section we will give a summary of state of the art in GPU acceleration of im
age processing algorithms.The section is organized as follows:First,several popular
parallel computing models are brieﬂy described.Second,GPGPU is introduced and
the most important surveys and performance studies are summarized.Third,a survey
of GPU implementations is made,with respect to image processing algorithms.In the
ﬁnal subsection,we focus on implementations dealing with large images.
2.1 Parallel Computing Architectures and Models
2.1.1 NVidia CUDA
As the number of papers published shows,one of the most widely used parallel pro
gramming models is CUDA,developed by NVidia.It is very easy to use for program
mers,since it introduces a small number of extensions to C language,in order to pro
vide parallel execution.Another important features are ﬂexibility of data structures,
explicit access on the different physical memory levels of the GPU,and a good frame
work for programmers including a compiler,CUDASoftware Development Kit (CUDA
SDK),a debugger,a proﬁler,and CUFFT and CUBLAS scientiﬁc libraries [CUD10e]
[CUD10d].
From the hardware point of view,NVidia graphics card architecture consists of
a number of socalled streaming multiprocessors (SM).Each one includes 8 shader
processor (SP) cores,a local memory shared by all SP,16384 registers,and fast ALU
units for hardware acceleration of trancendental functions.A global memory is shared
by all SMs and provides capacity up to 4 GB and memory bandwidth up to 144 GB/s
(to July 2010).FERMI architecture introduces newSMs equipped with 32 SPs and 32768
registers,improved ALU units for fast double precission ﬂoating point performance,
and L1 cache.
From the programmer’s point of view,every program consists of two parts,a host
code ran on CPU (called a ’host’),and a kernel—a piece of code ran in parallel on
GPU (called a ’device’).CUDA’s extensions to the C programming language allow
programmer to deﬁne how many threads performing a kernel will be executed on
a device.The number of threads can greatly (up to 10,000) exceed the number of
processing units which execute the program.This paradigm allows programmers to
consider GPU as an abstract device providing a huge number of virtual resources.
Therefore,not only it provides a good scalability for future hardware,it also allows to
hide a global memory latency by very fast thread switching [CUD10c].
The threads executed on GPU form an 1D,2D,or 3D structure,called a block
(Figure 3).The blocks are arranged in 1D or 2D layout,called a grid.Thus,each
thread can be exclusively deﬁned by its coordinates within a block,and by coordinates
of its block within a grid.Each block of threads is executed on a sigle SM;therefore,
all threads of a single block can share data in a shared memory.Block’s threads are
executed in warps of 32 threads each.
A memory hierarchy is introduced in a number of levels (Figure 4).In the lowest
level,there are small,fast,nonshared registers.Then,each SM contains a shared
5
PDFill PDF Editor with Free Writer and Tools
blockIdx
Thr
ead
(
3
, 0)
Block
(1, 1)
j =
Thr
ead
(
1
,
1
)
MatAdd
int
Thr
ead
(
1
,
2
)
The number of threads per block and the number of blocks per grid specified in the
<<<
…
>>>
syntax can be of type
int
or
dim3
.
Two

dimensional blocks or grids can
be specified as in the example above.
threadIdx
Thr
ead
(0, 0)
Each block within the grid can be
identified
by a one

dimensional or
two

dimensional
index
accessible within the kernel through the built

in
block
Idx
variable. The dimension of the t
hread block is accessible within the kernel through
the built

in
blockDim
variable.
blockIdx
Thr
ead
(
2
,
1
)
Grid
float
Extending the previous
MatAdd
()
example to handle multiple blocks, the code
becomes as follows.
Thr
ead
(
2
, 0)
Block
(2, 1)
Thr
ead
(
3
,
2
)
blockDim
C[N][N])
.y +
float
A[N][N],
Block
(
0
, 1)
CU
DA Programming Guide Version 3.0
9
// Kernel definition
if
int
Block
(
1
,
0
)
Block
(
1
, 1)
threadIdx
.y;
(i < N && j < N)
blockDim
i =
Block
(2,
0
)
Thr
ead
(
2
,
2
)
C[i][j] = A[i][j] + B[i][j];
.x;
.y *
Figure
2

1
.
Grid of Thread Blocks
(
Thr
ead
(0,
1
)
__global__
Thr
ead
(
3
,
1
)
float
Thr
ead
(
1
, 0)
Thr
ead
(0,
2
)
B[N][N],
.x +
void
{
Chapter
2: Programming Model
.x *
Block
(
0
,
0
)
Figure 3:Threads hierarchy in CUDA programming model [CUD10c].
Thr
ead
Block
(
0
,
0
)
2.4
Heterogeneous Programming
Figure
2

3
Global memory
Chapter
2: Programming Model
CU
DA Programming Guide Version 3.0
11
Grid 1
Block
(
0
, 1)
Grid 0
Block
(
0
,
2
)
Block
(
1
,
0
)
Block
(
1
, 1)
Block
(
0
,
0
)
Block
(2,
0
)
As illustrated by
Per

thread local
memory
Per

block shared
memory
Block
(2, 1)
Block
(
0
, 1)
Block
(
1
, 1)
Block
(
1
,
0
)
Figure
2

2
.
Memory Hierarchy
,
the
CUDA
programming model
assumes that the
CUDA threads execute on a physically separate
device
that operates as a coprocessor
to the
host
running the C program.
This is the case, for examp
le, when
the kernels
execute on a GPU and
the
rest of the
C program executes on a CPU.
Thread
Block
Block
(
1
,
2
)
Figure 4:Memory hierarchy in CUDA programming model [CUD10c].
6
memory,as mentioned above.This memory is also fast and small;it can provide
communication between threads in a block.Another important memory space is a
constant memory,which is readonly.Since it is cached,it usually serves as a fast
storage place for function parameters and constants.A texture memory is readonly
and cached,as well,but it also includes several useful inhardware features,such as a
number of addressing modes,interpolation,and locality.In the highest level,there is
a large but longlatency global memory space.
2.1.2 ATI Stream
ATI Stream is a parallel GPU architecure,designed for GPGPU computations,for
merly developed by ATI company,which was acquired by AMD in 2006.The series
of graphics cards are branded as ATI Radeon,ATI FirePro,and AMD FireStream.
Also,lowpower cards for mobile devices are developed,such as ATI Mobility Radeon.
Hardware implementation of stream processing units is architecturally different from
NVidia’s stream processors,allowing a high number of stream processing units per
core,L1 cache,and increased double precision performance.However,the cards offer
lower memory bandwidth and processor clocking,so the overall performance is com
parable with the competitor’s.The GPGPU developers are supplied by ATI Stream
SDK,allowing to develop applications in the highlevel OpenCL language,described
bellow [ATI10b].Figure 5 illustrates the relationship of the computing components in
the ATI Streamarchitecture.
2.1.3 OpenCL
OpenCL is a framework for heteregenous platforms,consisting of CPU,GPUand other
architectures.It is an open standard developed by Khronos Group [Ope10a].It in
cludes a Cbased crossplatform language,a platform layer API,a runtime API,and
scientiﬁc libraries.In a context of GPU programming,it also supports an interoper
ability with the OpenGL standard.Analogous to CUDA,basic execution units are
called kernels.They can be either dataparallel,allowing a ﬁnegrained paralleliza
tion suitable for GPUs,or taskparallel,providing a coarsegrained parallelization for
CPUs.A memory model is deﬁned in levels of private,local,constant and global
memory space [How10] (Figure 6).OpenCL is supported by the both mainstream
GPU vendors,however studies,such as by Kong et al.[KDY
+
10],revealed that with
NVidia graphics cards,CUDA implementations generally achieve better performance
than their OpenCL equivalents.
2.1.4 Other architectures
Although graphics cards are probably the most widely used devices nowadays,other
hardware architectures can be exploited for massive parallel computing as well.Be
sides large expensive CPU clusters,a ﬁeldprogrammable gate array (FPGA) are pop
ular for their conﬁgurability and compact onechip design,allowing them to be em
ployed in lowpower autonomous devices [BR96].Since FPGAarchitecure is much less
accesible than GPU and requires lowlevel"closetometal"approach,we have decided
7
Figure 5:ATI Streamsoftware enviroment [ATI10a].
Figure 6:Memory hierarchy in OpenCL programming model [ATI10a].
8
not to deal with FPGAs in our thesis topic.However,it can be mentioned that FPGAs
can be also used in image processing [LDZZ09] [YNAS09].Also,a research in adapt
ing CUDA programming model into FPGA is conducted [PGS
+
09],building a bridge
between GPGPU development and FPGA design.
2.2 GeneralPurpose Computation on GPU
GPGPUis the technique of mapping generalpurpose algorithms to graphics hardware,
exploiting its high computation and data throughput.The performance of graph
ics cards was employed in many applications,including trafﬁc simulation [CBM
+
08],
molecular dynamics [Joh07],seismic modelling [KEGM10],nbody simulation [Nyl07],
and image processing [CDMS
+
08] [PSL
+
10] (Figure 7).For further information,the
website GPGPU.org [GPG] can be recommended as an uptodate source for both de
velopers and researchers,including notiﬁcations to recent papers.The OpenVIDIA
framework [FR] is another project focused on the topic,specializing in OpenGL,Cg
and CUDA implementations.
One of the ﬁrst methodical surveys made on GPGPU was made by Owens et al.
[OLG
+
05].Since published in 2005,it did not cover recent research based on popular
highlevel programming languages,such as CUDA or OpenCL.Not all the informa
tion,however,became outofdate,as some paradigms in parallel computing remained
unchanged.A streaming model,as an abstraction for GPU programming,was con
sidered in the survey.Several fundamental operations were introduced,such as map,
reduce,scatter and gather,streamﬁltering,sort and search.
A map operation applies a given function to every element in a given data stream.
Computing a smaller output data stream from a larger input stream is called a re
duction.Scatter and gather are two fundamental memory operations,which represent
indirect read and write memory acceses.Stream ﬁltering provides an ability to select
a subset of elements from a stream.Sort and search,two wellknown operations,need
very careful attention to be implemented efﬁciently on GPU.
The survey also discussed several data structures that can be used in GPGPU,un
covering some constraints imposed by GPUfor nonindexable structures,such as prior
ity queues,sets,and kd trees.Many results of GPU acceleration were cited,showing
a good effort that had been already made on GPGPU using highlevel shading lan
guages.A revised version of the survey was published in 2007 [OLG
+
07].
Many performance studies comparing GPGPU and traditional CPU were pub
lished.A paper by Gong et al.[GLG05] provided a comparison between CPU (In
tel Pentium 4) and two commerciallysuccesful GPU platforms (ATI Radeon 9800 and
NVidia GeForce 5950),in a number of realtime image processing applications.The au
thors used three different shading languages (HLSL,Cg and GLSL) and showed that
both the GPU platforms are competitive,showing signiﬁcant speedups (up to 36 in
the case of ATI and up to 22 in the case of NVidia) over CPU.
A study by Che et al.[CBM
+
08] examined several common,computationally de
manding applications,including trafﬁc simulation,thermal simulation,and kmeans
algorithm.The speedup achieved by GPU implementations using CUDA was between
6 and 40 ,compared to CPU implementations,using a common commercially
available hardware.
9
(a) Fast and stable rendering of water surfaces
[KW03].
(b) A brain cortex segmentation [CLW04].
(c) Calculating electrostatic potential maps in
molecular dynamics [Joh07].
(d) Feature extraction [PSL
+
10].
(e) A simulation of a galaxy as an Nbody system
[Nyl07].
(f) A 3D reconstruction of a mouse placenta
[RUCH09].
Figure 7:Applications in GPGPU.
10
Ryoo et al.[RRB
+
08] discussed NVidia GPU’s key features and optimization strate
gies.They studied GPU kernels,implemented in CUDA,for dozen various applica
tions,including H.264 video encoder,LatticeBoltzman method,and MRI image re
construction.They claimed tremendous speedups comparing NVidia GeForce 8800
GTX to 2.2 GHz Opteron 248 (a singlecore CPU),even using SIMD optimizations for
CPU implementations.For example,MRI reconstruction algorithms achieved speedup
rates over 100 ,including data transfer overhead.The most of the kernels attained
speedups over 10 ,actually only two of themperformed similarly on CPU and GPU,
in particular,H.264 video encoder and FDTD electromagnetic wave propagation sim
ulator.The basic optimization principles described in the paper are maximizing the
ratio of ﬂoatingpoint operations,hiding global memory latency by high streaming
processor’s occupancy,and balancing the usage of global memory,shared memory,
and registers,in order to achieve optimal memory bandwidth.
Another survey of many different CUDA implementations was published by Gar
land et al.[GLGN
+
08].The topics of interest,in this case,were molecular dynam
ics,numerical linear algebra,medical imaging,ﬂuid dynamics,and seismic imaging.
The most impressive results were attained in molecular dynamics.In the former ap
plication,parallelization was naturally provided by mapping each atom to a single
thread.A CUDA application,called Highly Optimized ObjectOriented Molecular Dy
namics (HOOMD) was introduced and compared to another parallel software,called
LAMMPS,showing that the former one executed on a single GeForce 8800 GPU per
formed similarly to the latter one executed on a cluster of 32 CPU cores.Another com
parison was made using a popular application Folding@Home;in this case,GeForce
8800 outperformed a dualcore CPU by rate of 100 .However,the results in other
applications were not that impressive.For example,in linear algebra,a SGEMMdense
matrixmatrix multiplication kernel by Volkov and Demmel was compared to Intel’s
Math Kernel Library (MKL).Testing the graphics card mentioned above,and Intel
Core2 Quad processor,the GPU implementation performed 3 faster.The same re
sults were achieved for matrix factorization problem,in particular LU,Cholesky,and
QR factorizations.
In contrast with many previous studies,a recent paper by Lee et al.from Intel
Corporation [LKC
+
10] shows that with a careful optimizing of CPU implementation,
still the CPU can be succesful in many applications.They perform a comparison of
14 various implementations,showing speedups from 0.5 to 15 (GPU over CPU)
which is signiﬁcantly less than in many previous studies.Their experiment was made
with Intel Core i7 and NVidia GTX280.
The paper presents not only results of comparison,but also examination of the
results,emphasising pros and cons of both architectures.The GPU beneﬁts from
higher global memory bandwidth and ﬂoatingpoint operation performance,hardware
scattergather support,acceleration of transcendental (e.g.exponential and trigonomet
ric) functions,and fast texture lookup capability.As a result,kernels,such as bilateral
ﬁlter,GJKcollision detector,SAXPY (basic linear algebra task ax +y),and Lattice Boltz
mann method are accelerated by rate of 5 and higher on GPU.On the other hand,the
weaknesses of GPU are noncached global memory (which does not stand for a new
FERMI architecture),reduction overhead,and synchronization overhead.Thus,radix
11
sort and constraint solver perform better on CPU than on GPU;and kernels,such as
Monte Carlo or tree index search,are performed on both architectures similarly.
It can be pointed out that the results should slightly change using more recent
graphics hardware,since NVidia has already released FERMI architecture by that time.
Also,more impressive results can be achieved using multiple graphics cards in paral
lel.Nevertheless,it has been shown that the programmers indeed should be careful
judging their CPU and GPU implementations;and that many tremendous speedups of
100 and more,declared in GPGPU,should be treated with caution.
2.3 Image Processing Acceleration
Image processing is a wide ﬁeld,including many various applications.It involves
many speciﬁc algorithms.Among others,these are Fourier,Hough,Wavelet,and
Radon transform [GW02] [SM06],matrix multiplication,linear equation solvers,and
principal component analysis [Pra01] [Jäh05],partial differential equations (PDE) solvers
[Kim05],sorting and searching algorithms [Pra01] [GW02],statistical methods [Fra04],
morphological operations [Ser83] [Mar96],and graph algorithms.Many of these meth
ods are computational intensive and suitable for parallelization [BFRR01].
CastañoDíez et al.[CDMS
+
08] presented a detailed performance evaluation of
image processing algorithms,with respect to their GPU implementations.They tested
a number of particular functions frequently used in image processing,such as a fast
Fourier transform,a matrix multiplication,and several linear and nonlinear ﬁlters.
They also implemented more complex image processing algorithms,such as weighted
back projection used for 3D image reconstruction,and principal component analysis
used for classiﬁcation of large data sets.The GPU implementations were written in
CUDA and Cg shading language and claimed speedups rates of at least 10 in the
most cases,comparing Intel Dual Core CPU and NVidia Quadro FX5600.
Several GPU computing toolboxes for MATLAB were developed,such as Jacket
[AE10] and Parallel Colt [WN].Kong et al.[KDY
+
10] developped an opensource
GPU library for the MATLAB Image Processing Toolbox.Mapping 12 various func
tions from IPT on GPU,using both CUDA and OpenCL parallel models,they studied
which algorithm and hardware features inﬂuence the overall performance.Based on
algorithm characteristics,they divided the functions into four categories:data inde
pendent (e.g.image intensity adjustment),data sharing (edge detection),algorithm
dependent (Euclidean distance transform),and data dependent (dither).The functions
of the ﬁrst category map naturally on a GPU and achieve high kernel performance.
They are usually limited by memory bandwidth and data transfer overhead.The data
sharing algorithms are excellent candidates for massive parallelization.For each im
age element,the information about its neighbours is needed,therefore a careful opti
mization is required in order to hide global memory latency,by means of eliminating
data reuse.With such optimization strategies,tremendous speedups can be achieved.
The algorithmdependent functions often require to be fundamentally revised,with
respect to architecture differences.Still,they are good aspirants for GPU acceleration.
The dither function,falling into the last category,performed poorly on GPU,due to
large fraction of sequential processing,large synchronization overhead,and low ratio
12
of computations to memory accesses.
The imaging process can be divided into several phases.First,an image is acquired
by a device,digitized,and reconstructed.This is called image formation.Then,a pre
processing is made,applying various image enhancement techniques.This includes
tone mapping,image restoration,and geometrical methods.Finally,image analysis is
performed,extracting desired information fromthe image,by means of morphological
processing,edge detection,image segmentation,feature extraction,shape analysis,
and image registration [Pra01].In the following text,we will discuss particular image
processing methods and their GPU implementations.
2.4 Image Formation
Basically,most of imaging techniques use projection from 3D space (real world) into
2D space (imaging device).In many applications,these twodimensional images are
used for further processing,without special requirements for image formation.In
other applications,such as magnetic resonance imaging (MRI),computed tomography
(CT),and 4Pi confocal microscopy [CC74] [HS92],a number of acquired images is used
for reconstructing a complex 3D image.This leads to advanced approaches to image
formation.
In [XM05],Xu and Mueller implemented three 3D reconstruction algorithms from
CT,namely Feldkamp ﬁltered backprojection,simultaneous algebraic reconstruction
(SART),and expectation maximization (EM),on graphics hardware.From 80 projec
tions,they obtained reconstructions of 128
3
pixels.The quality of GPU results was
comparable to those obtained from CPU,achieving approximately 10 speedup.In
[XM09],Xu and Mueller introduced GPU implementation of the OSSIRT iterative re
construction algorithm (Figure 8).Furthermore,they implemented bilateral ﬁlters,for
efﬁcient regularization of reconstructed results.For the latter operation,they achieved
tremendous speedups of more than 100 ,using OpenGL Shading Language with
NVidia 8800 GT graphics card.They approved the OSSIRT method to outperform
SART for data with high level of noise.
CastañoDíez et al.[CDMS
+
08] presented three different implementations of SART,
two of them using CUDA,and one with a texturebased approach using Cg.All the
implementations achieved approximately 50 speedup,compared to their CPUequiv
alents.The Cg implementation performed the best,however,writing and debugging
it required considerably higher effort than in CUDA,as noted by the authors.The
GPUaccelerated weighted back projection,used for 3D reconstruction from 2D elec
tron micrographs,was tested,as well.This included lowpass ﬁltering of the acquired
images,rotation,Fourier space weighting,and backprojection into a 3D image.They
attained speedups up to 10 on graphics hardware.
Stone et al.[SHT
+
08] accelerated advanced MRI reconstruction algorithm,with
respect to optimal reconstruction from nonCartesian scan trajectories.They imple
mented anatomically constrained reconstruction algorithm by Haldar et al.[Hal08].
The key part of the algorithm was the conjugate gradient linear solver,designed to
perform an iterative matrix inversion.As the number of elements in matrices was
equal to the square of voxels in the reconstructed image,the matrices contained tril
13
Figure 8:The iterative tomographic reconstruction of medical images with bilateral
ﬁltering [XM09].
lions of complexvalued elements for images of 128
3
pixels.Fortunately,the matrices
had either a sparse,or a convolutional structure,allowing programmers to handle
with them in a compressed format,or in a Fourier domain,respectively.The authors
tuned their algorithm,comparing seven versions with various optimizations.Their
GPU best implementation was able to compute the MRI reconstruction within less
than two minutes,while the CPU implementation completed in 23 minutes.Using
four NVidia Quadro FX 5600 graphics cards,a multiGPU implementation was tested
as well,reducing the overall time to less than a minute.
In ultrasound imaging,inversescattering algorithms are used for generating 3D
volumetric images of a breast.These methods are time consuming,however,specialists
would like to discuss results with their patients within the same visit.In Garland et al.
[GLGN
+
08],GPU implementations of particular procedures used in these algorithms
were discussed,especially 2D convolution by fast Fourier transformand CAXPBY rou
tine,which computes several basic operations from linear algebra.In the algorithm’s
run,over 50 million 2D FFTs needed to be computed,accounting for over 75 percent
of the computation time.Implementing a batched 2D FFT kernel,Garland et al.made
the GPU convolution approximately 16 times faster than the CPU implementation.
An iterative tomographic reconstruction was studied by Jang et al.[JKDP09].Both
forward and backward projection steps were implemented using CUDA on NVidia
Tesla S870.This graphics hardware consists of four independent Tesla C870 GPUs,
therefore,a multiGPU approach was examined.For the forward projection,they
14
Figure 9:Examples of HDR images processed by tone mapping [GWWH05].
achieved 35 speedup with a single GPU and 70 speedup with all four GPU cores,
compared to Intel Core 2 Duo CPU.For the back projection,GPU performed even
better.The multiGPU performance was signiﬁcantly degraded due to CPU thread
synchronization overhead,as each GPU core had to be driven by an individual CPU
thread.However,this could be caused by the operating system’s thread scheduling
policy,as stated by the authors.Using the single ﬂoatingpoint precision,the results
showed negligible difference between the CPU and GPU reconstructions.
2.5 Image Enhancement
2.5.1 Tone Mapping
Tone mapping is the process of mapping pixel intensity values with high dynamic
range (HDR) to a smaller dynamic range for the purposes of visualization.As recent
display technologies still provide limiting dynamic range,tone mapping is an impor
tant part of image visualization.It also includes brightness and contrast enhancement.
In this context,simple histogrambased adjustments can be also considered a part of
tone mapping.As these methods involve a large number of pointwise operations,
they can be parallelized on GPU very naturally;however,the speedup may be limited
by global knowledge required (see Introduction) in a particular algorithm [BFRR01]
[OLG
+
05].
Drago et al.[DMAC03] designed a fast HDR tonemapping algorithmand achieved
a modest speedup,comparing C implementation on Pentium 4,and OpenGL imple
mentation on ATI Fire GL X1.They tested their algorithm on common real images,
such as a photograph of the Standford memorial church.Another method,tested
on digital photographs,was published by Goodnight et al.[GWWH05] (Figure 9).
They implemented wellknown algorithm of Reinhard et al.[RSSF02] and declared
approximately 10 speedup using an OpenGL implementation,using AMD Athlon
1800+ CPU with ATI Radeon 9800 Pro graphics card.Raspe and Müller presented
15
an interactive tonemapping framework for medical volume data [RM07].They built
the crossplatformprogramming enviroment called"Cascada"using OpenGL Shading
Language (GLSL).Their results revealed tremendous speeup factors up to 250 ,tested
on Intel Core 2 Duo and NVidia GeForce 8800 GTS.However,the performance was sig
niﬁcantly degraded by data transfer overhead.Therefore,the algorithms worked best
when avoiding data transfers.For example,user interface for tuning parameters of
tone mapping maintained realtime performance.
2.5.2 Image Restoration
Image restoration aims to remove undesired artifacts in images,including blurring
introduced by an optical system,a motion blur,and a noise from electronic sources.
Thus,it is often an integral part of image formation.Removing blur leads to decon
volution techniques;eliminating noise usually involves linear and nonlinear ﬁlters.
However,for real images,use of advanced approaches is inevitable,taking both degra
dation features into account.
Gong et al.[GLG05] implemented a number of simple image ﬁlters,including
mean,Gaussian and median ﬁlter,using the HLSL shading language.They compared
two graphics cards (ATI Radeon 9800 XT and NVidia GeForce FX 5950 Ultra) with the
Intel Pentium 4 processor.For linear ﬁlters,they achieved approximately 4 speedup
with the ATI GPU and almost no speedup with the NVidia GPU.For nonlinear ﬁlters,
such as the median,the ATI card attained the same processing times as the CPU,and
the NVidia card was 6 slower.It was shown that nonlinear ﬁlters indeed perform
worse than linear ﬁlters,as they involve many comparing operations,which are not
very suitable for paralelization of graphics hardware.In this work,only very small
3 3 ﬁlter kernels were discussed.Much larger kernels would probably give more
precise results,eliminating the effect of overheads.
In CastañoDíez et al.[CDMS
+
08],both linear and median ﬁlters were accelerated,
showing good performance,especially for large 3D images.For small images,such as
128 128 pixels,only a modest speedup was attained.This could be expected,due to
some overheads,however,the authors did not comment this issue in detail.Kong et
al.[KDY
+
10] implemented MATLAB function conv2,which provide a convolution of
2D images.With image sizes of over million pixels,the achieved kernel speedups were
in order of hundreds,however,the overall speedups were approximately 10 ,taking
into account data transfers.A number of linear ﬁlter implementations can be found in
CUDA SDK [CUD10d].
Among advanced restoration methods,RichardsonLucy deconvolution algorithm
[RIC72] [Luc74] is one of the best known and widely used.Domanski et al.[DVW09]
compared a CPU implementation,using the FFTW library [FJ09],and a GPU im
plementation,using the CUDA framework and the CUFFT library [CUD10b].Two
approaches were discussed,deconvolution in the spatial domain and in the Fourier
domain.Even for relatively small sizes of a ﬁlter kernel,the FFTbased approach out
performed the spatialdomain approach.Comparing the CPU and the GPU platforms,
i.e.the NVidia GTX 260 graphics card and the Intel Xeon QuadCore processor,the
GPU implementation was 8 faster for 2D images and 4 faster for 3D images.
Quammen et al.[QFTI09] presented a performance study of three 3Ddeconvolution
16
(a) Original image
(b) Blurred and noisy image
(c) Reconstructed image
Figure 10:Reconstruction of an astronomical image [SZZ10].
algorithms,namely Wiener [Wie64],Janssonvan Cittert [SN06],and RichardsonLucy,
on both multicore CPUs and GPUs.As the FFTbased approach was preferred,both
the FFTW and the CUFFT libraries were employed.The scalability of algorithms on
CPU was studied,measuring performance on various numbers of CPU cores,ranging
from 1 to 16.The results revealed that the CPU platform,consisting of 8 dualcore
AMD Opteron nodes connected by a HyperTransport network,performed best in the
case of 8 CPU cores.For a larger number of cores,a sharp dropoff in scalability was
revealed,probably caused by some implementation features in FFTW library.Then,
the 8core CPU and a singlecore GPU,NVidia Quadro FX 5600 in particular,were
compared.The CPU performed 2 faster for the Wiener ﬁlter,and approximately
1.5 faster for the other ﬁlters.As the time complexity of all the algorithms is strongly
depended on FFT performance,other FFT libraries for GPUshould be considered,such
as [NOEM08] and [GLD
+
08].For further information,refer to section 2.6.6.
Expectation Maximization (EM) method,as described in [SV82],Scaled gradient
projection (SGP) [BZZ09] and Total Variation (TV) regularization [ROF92] are examples
of optimization techniques,used for edgepreserving removal of noise.Seraﬁni et al.
[SZZ10] proposed a GPU version of both SGP and EM using the CUDA framework,
and the CUFFT [CUD10b] and CUBLAS [CUD10a] libraries (Figure 10).They used
double precision in the CPU implementation,tested on AMD Athlon X2 DualCore,
and single precision in the GPU implementation,ran on NVidia GTX 280.According
to their results,SGP largely outperformed EM in means of the time consumption,
obtaining the same image quality.The speedup achieved by GPU was approximately
20–30 for both methods.Moazeni et al.[MBS09] described a GPU implementation
of TV regularization for denoising of diffusion tensor images (DTI),used in MRI.Their
algorithmrequired less than 2 minutes to process a 3Dimage of 128
3
voxels on NVidia
Quadro FX 5600.The same algorithmneeded more than 3 hours to be completed on a
dualcore CPU.
2.5.3 Geometrical Methods
Geometrical modiﬁcation of images is very common image processing operation,in
cluding basic transformations,such as translation,rotation,and scaling;and advanced
17
procedures,such as nonlinear warping and perspective transformation.These algo
rithms are very naturally accelerated on graphics hardware,since they actually repre
sent one of the most typical tasks for GPUs.Geometrical methods are often an integral
part of image formation and image registration,such as in [SKSF07] and [MOOXS08].
For further information on the topics,refer to sections 2.4 and 2.6.6.
In one of early works,Ren et al.[RPZ02] described hardware acceleration of the
Elliptical weighted average (EWA) surface splatting method,used for highquality ren
dering of pointsampled 3Dobjects.In the paper,an object space formulation was pre
sented,and a GPU implementation was developed,using DirectXsupported graphics
hardware.Kelly and Kokaram [KK04] proposed an algorithm for motion estimation
in video sequences.The key part of the algorithm was fast image interpolation,us
ing textures on graphics hardware.Fan et al.[FEK
+
05] described an interpolation
method called natural neighbor interpolation,based on Voronoi tessellation.Strengert
et al.[SKE06] applied bilinear texture interpolation,implemented on modern graphics
hardware,to accelerate pyramid methods,which are popular especially in fast im
age processing.Kraus et al.[KES07] proposed the edgedirected image interpolation
method for image upsampling,which avoided ringing artifacts,and preserved smooth
and sharp edges.The graphics hardware was exploited for the computation,using
OpenGL.Zhu et al.[ZLB
+
09] proposed Spherical Depth Image,a spherical represen
tation of imagebased models,adopted to image warping.A realtime performance
was achieved,implementing the algorithmon a graphics hardware.
Terrain and surface rendering is a typical application for geometrical methods.
Botsch and Kobbelt [BK03] proposed a fast framework for pointbased geometry.The
algorithmwas implemented on the NVidia GeForceFX 5800 Ultra graphics card,using
OpenGL.The splat ﬁltering was used for smoothing textures,achieving more eye
appealing results.Dachsbacher and Stamminger [DS04] presented a view dependent
levelofdetail approach for interactive terrain rendering in realtime.During render
ing,procedural geometric and texture detail adding was accelerated by graphics hard
ware.Schneider et al.[SBW06] described a method for creation and rendering of
realistic landscapes.The realtime fractal terrain synthesizer,combined with WYSI
WYG user interface,was built and implemented on NVidia GeForce 7800 GTX using
OpenGL.
Levin et al.[LDS05] implemented the thin plate spline (TPS) warping algorithm
[Duc76] using GLSL.Warping was applied on CTbased volumes of 512512 173
voxels,using 92 userdeﬁned displacement vectors.Xiao et al.[XCZ09] proposed a
bilinear warping method for 2D CT reconstruction,accelerated on graphics hardware.
Using OpenGL and a HP W4000 PC workstation having 3DLabs Wildcat 6110 graphic
card,the GPUimplementation achieved 20 speedup for forwardprojection stage and
15 speedup for whole reconstruction.
2.6 Image Analysis
2.6.1 Morphological Processing
Mathematical morphology is based on analysis and processing spatial structures in
images.The basic concepts of this theory trace back to set theory,integral geometry
18
and lattice algebra.Morphological methods can be used both for image preprocessing
and for image analysis.For example,opening and closing can be adopted for smooth
ing contours of objects or eliminating small holes in objects.In the other application,
image skeletonizing can be used to describe structure of objects.Watershed transform
is one of the well known approaches to image segmentation.For information about
GPUbased watershed implementations,refer to section 2.6.3.
Dilation and erosion are two fundamental morphological operations.Due to their
nature,they can be very simply implemented on graphics hardware,as presented in
[HB08] or [KDY
+
10].Another GPU implementation of dilation,erosion,and closing
can be found in [ESA05].In this paper,application of morphological methods to pre
processing in medical imaging was studied.Sugano and Miyamoto [SM08] proposed
parallel implementation of several morphological operations,however,they did not
use graphics hardware,but the highperformance Cell Broadband Engine processor.
Dortmont et al.[vDvdWT06] implemented a method for computing distance trans
forms and skeletons of 3D objects on graphics hardware.Rong and Tan [RT06] pro
posed a fast GPUbased jump ﬂooding algorithmto compute an approximations of the
Voronoi diagram and the distance transform.Cao et al.[CTMT10] described a CUDA
implementation of a parallel banding algorithmfor computing the exact Euclidean dis
tance transform (EDT),which can be further used for performing such morphological
operations as dilation,erosion,opening,and closing.
2.6.2 Edge Detection
Edges,i.e.changes or discontinuities in image luminance or color values,represent
important regions in image.According to their character,edges are often detected by
derivativebased operators,for example,ﬁrstorder derivative methods (based on gra
dient or directional derivatives) and secondorder derivative methods (e.g.Laplacian
operator).Edge detectors can be often implemented by means of convolution,either
by a basic convolution,or by a separable convolution,as many edgedetecting ﬁlters
have a separable kernel.In the case of large unseparable kernels,a convolution in
Fourier domain can be performed,as described in [F
ˇ
C06].CUDA implementation of
convolution methods was described by Podlozhnyuk [Pod07].Other approaches to
edge detection are based on mathematical morphology [YQ08] or on active contour
models [CV01].An example GPU implementation of the latter can be found in [SA09].
Canny is one of widely used edge detectors,based on ﬁrstorder derivatives.Asim
ple CUDA implementation of Canny operator is described in [Fun05].OpenGL and
Cg versions were presented by Ruiz et al.[RUG07].A more detailed study on Canny
operator was presented by Luo and Duraiswami [LD08].The algorithm was imple
mented in CUDA and subsequently integrated in MATLAB,achieving approximately
3 speedup over an optimized CPU implementation in the OpenCV library,tested on
Intel Core 2 Duo 6600 and NVidia GeForce 8800 GTX.Kong et al.[KDY
+
10] presented
GPUaccelerated procedure for MATLAB,performing the Prewitt edge detector.They
declared more than 1000 kernel speedup over the edge function implemented in the
MATLAB IPT,without taking data transfers into account.
Edgedetecting algorithms can be also included in image segmentation or feature
extraction methods.For further information,please refer to sections 2.6.3 and 2.6.4.
19
Figure 11:3D segmentation of ﬁshes in acoustic underwater images [SA09].
2.6.3 Image Segmentation
Image segmentation aims to recognize objects in images.There are many approaches to
image segmentation,namely pixelbased (e.g.thresholding),edgebased (derivative
based approaches),regionbased (splitandmerge and regiongrowing),and model
based (Hough transform).Segmentation by global optimization leads to more complex
regularization techniques [Jäh05].Mathematical morphology introduces a speciﬁc cat
egory of segmentation algorithms,for example,a watershed transform[BM91].
Many examples of early work on GPUbased segmentation methods can be found
in comprehensive studies by Hadwiger et al.[HLSB04] and Owens et al.[OLG
+
05].
We will brieﬂy mention some of them.Diewald et al.[DPRS01] presented an early
GPU implementation of anisotropic diffusion models.Sherbondy et al.[SHN03] de
scribed GPU implementation of isotropic diffusion based on the PeronaMalik model,
with an application to 3D volume medical data.Lefohn et al.[LCW03] used a stream
ing approach on a GPU to accelerate a level set method.They proposed an efﬁcient
memory management scheme using 2Dtextures;and achieved 15 speedup.Further
more,in [CLW04],they presented a new tool,called GIST,for interactive GPUbased
3D segmentation.
Sharma and Anton [SA09] proposed a level set method for 3D reconstruction of
acoustic underwater images (Figure 11).Their approach was based on level set evolu
tion using MumfordShah segmentation functional,with a special view to noise sup
presion.They implemented a number of procedures in CUDA,including signed dis
tance transform,weighted median ﬁlter,and PDE solver.As the acoustic images from
marine surveys can often result in gigabytes of data,the algorithm was optimized in
means of memory usage.Due to large resolution,the images were divided into small
volumes of size 128
3
voxels,stored into 3D textures,and processed independently.
They were able to complete the computation in one minute,using the mobile GPU,
namely NVidia GeForce 8600MGT.
An interesting application of segmentation can be found in Zitnick et al.[ZKU
+
04].
In this paper,a method for interactive control of viewpoint in video sequences was pro
posed,combining synchronized streams frommultiple cameras.As a result,user could
replay video from a new viewpoint,continously change the point of view in dramas,
and create"freeze frame"visual effects.Their segmentation method was based on color
matching.For more convenient results,a variant of Perona and Malik anisotropic diffu
20
sion was used to reduce noise in images.Their GPU implementation showed realtime
performance for data stored in the graphics hardware’s memory.
The watershed transformviews a greyscale image as a relief ﬂooded by water.Once
the particular ﬂooded regions,called catchment basins,merge,the lines of merging are
marked as watershed lines,i.e.the segmentation lines.The major issue with the stan
dard watershed is oversegmentation.Therefore,several advanced approaches were
designed to address the problem.As the watershed is a global operation,it is fairly
challenging to be implemented on graphics hardware.Several implementations can
be found,such as [PGX08] and [KP08].In the former work,the multidegree immers
ing watershed algorithm was written in CUDA.Comparing NVidia GeForce 8500 GT
graphics card with Pentium4,more than a double speedup was achieved.In the latter
work,a cellular automaton was designed to implement a number of graph algorithms
to perform the watershed transform.Compared with the tobogganbased watershed
implemented on CPU,the GPU implementation attained more than a double speedup,
as well,using Intel Xeon and ATI Radeon X1950 Pro.
Graph algorithms can be used not only in the watershed transform,but also in
graphcuts approaches.These methods are succesful competitors to level sets.Graph
cuts can be used e.g.for MumfordShah functional computation [GA09].They can
perform fast,however,they require a lot of memory.Therefore,they are difﬁcult to be
applied on large images,especially when implemented on GPU.CUDA implementa
tion,achieving a 10 speedup over a CPU implementation,can be found in [VN08].
Harish et al.[HN07] implemented several fundamental graph algorithms in CUDA,
including breadth ﬁrst search,single source shortest path,and allpairs shortest path.
They achieved a signiﬁcant speedup (up to 50 ) for artiﬁcial graphs,however,for
realworld graphs with small pervertex degree,the performance of GPU was compa
rable to CPU.Barnat et al.[BBB
ˇ
C10] proposed a CUDAimplementation for computing
strongly connected components of a directed graph.
2.6.4 Feature Extraction
An image feature is a representative characteristic or attribute of an image.The goal
of feature extraction is to transform input data,that can be very large,into a reduced
set of information,which is used for consequential analysis and classiﬁcation.Ampli
tude features are used for extracting information,contained in image amplitude,i.e.
luminance,spectral value,or other units.Measures describing an image histogram,
such as mean,standard deviation,skewness,or energy,often aid in amplitude fea
ture extraction.Image transforms,for example Fourier,Haar,and Hough transforms,
can be employed to detect shapebased features.Texturebased approaches include
edge detectors,autocorrelation methods,Gabor ﬁlters,dependency matrix methods,
singularvalue decomposition (SVD),and many others [Pra01].For the classiﬁcation of
image data,principal component analysis (PCA) and related techniques are used.
One of the methods,used not only for feature extraction,but also in other appli
cations,is the wavelet transform.Hopf and Ertl [HE00] proposed an early OpenGL
implementation of the transform.Measured on a Silicon Graphics Octane with R10000
195MHz processor and a MXE graphics pipe,they achieved a signiﬁcant speedup for
21
Figure 12:Feature extraction in a sunﬂower image [CVG08].
large images.Garcia and Shen [GS05] described a Cg implementation of 3D wavelet
reconstruction,and evaluated the performance on the Intel Xeon processor and the
NVidia Quadro FX 3400 graphics card.For a visible woman volume dataset,the GPU
achieved 6 speedup for Haar wavelets and 7 speedup for Daubechies.Another
OpenGL implementation of the wavelet transform,applied to waveletbased geomet
ric design,stylized image processing,textureilluminance decoupling,and JPEG2000
image encoding,was proposed by Wong et al.[WLHW07].
Fung et al.[FM05] implemented a number of feature detection methods in their
OpenVIDIAlibrary.The algorithmwas designed to compute unique appearancebased
feature vectors,and to track features in video sequences.A detailed description of fea
ture vectors computing can be also found in [Fun05].Ruiz et al.[RUG07] accelerated
the Hough transform for edge and circle feature detection,using OpenGL and Cg
shading languages.CastañoDíez et al.[CDMS
+
08] accelerated the PCA method on
graphics hardware,using CUDA.As the algorithmincludes a number of linear algebra
operations,such as a covariance matrix computation,the CUBLAS library was used to
implement such procedures.They achieved up to 10 speedup for classiﬁcation of
data fromelectron micrographs.
Cornelis and Van Gool [CVG08] presented a fast CUDA implementation of feature
detecting method,called Speeded Up Robust Features (SURF) [BTVG06] (Figure 12).
With respect to optimization of data layout in GPU texture memory,a scalespace
pyramid creation was proposed.Feature descriptor matching was performed using
different criteria,such as the sum of squared differences (SSD) or the dot product
22
(DOT).In the tests,performed on images of 640 480 pixels,the framerate of over 100
fps was attained on NVidia GeForce 8800 GTX and 35 fps on mobile graphics hardware,
namely NVidia Quadro FX 1600M.
Gipp et al.[GMH
+
09] proposed a GPUbased algorithm for computing the Har
alick’s texture features for biological applications.An optimized approach to a co
occurence matrix generation was introduced,in order to save a memory space.The
GPU implementation was written in CUDA and ran on the NVidia GeForce 280 GTX
graphics card,to perform32 faster than an optimized software version,executed on
an Intel Core 2 Quad machine.Ujaldon [Uja09] used CUDA to implement computa
tion of Zernike moments on NVidia GeForce 8800 GTX,claiming 5 speedup over the
best existing CPU implementation.
2.6.5 Shape Analysis
Shapeanalysing techniques are useful tools for classiﬁcation of objects within an im
age,including both qualitative and quantitative approaches.Many shape properties
can be described,such as topological attributes,size measurements and other geomet
ric attributes,shape orientation descriptors,spatial moments,and Fourier descriptors
[Pra01].
In shape analysis,distance transforms,e.g.the Euclidean distance transform(EDT),
are often employed for measurements of geometric characteristics.A number of GPU
implementation of EDT can be found in literature,for example in [RT06],[vDvdWT06],
and [CTMT10].Coeurjolly [Coe10] described several methods of volumetric analysis,
based on distance transforms.In particular,separable approaches were discussed,in
order to design highperformance parallel algorithms.
Peterson et al.[PDHW10] proposed a particle systemfor triangular meshes,which
can be used in statistical shape analysis applications.Considering a number of optimal
izations,both naive and tuned CUDAimplementations were described.The optimized
GPU version of the algorithm attained up to 300 speedup over the CPU version,
when evaluated on NVidia GeForce GTX 260 and Intel Core 2 Duo.The method was
tested on both synthetic and real data,demonstrating similar results.
2.6.6 Image Registration
Geometrical alignment of images is called image registration.It can be used either for
a fusion,in the case the images contain the same scene but different information,or
for stitching images together,in order to get one large mosaic picture.Basically,there
are two groups of image registration methods,areabased and featurebased.The ex
amples of the former category are correlation,mutual information,and optimization
methods,in the latter category there are methods using invariant descriptors,relax
ation methods,and wavelets [Zit03].
One of the areabased registration methods is based on the gradient ﬂow,where a
constant brightness in the image is assumed.Thus,the motion corresponds to bright
ness changes and motion vectors can be calculated frompartial space and time deriva
tives of image brightness.Strzodka et al.[SDR03] [SDR04] implemented the multi
scale gradient ﬂow method using DirectX9 graphics hardware,in particular,NVidia
23
Figure 13:Stitched panoramic view of the Durban Coastline [DDdV10].
GeForceFX 5800.They were able to register fully distorted images,with resolution of
513 513 pixels,within 10 seconds.In [SG04],Strzodka et al.presented a system for
realtime motion evaluation and visualization on 640 480 25 Hz image sequences.
Duvenhage et al.[DDdV10] proposed a GPU implementation of the LucasKanade
registration algorithm,using GLSL (Figure 13).Their algorithmwas applied to provide
3D stabilisation of the oceanbased wide area surveillance system.In their multi
resolution approach,the image pyramid is created and the registration is performed in
the lowest resolution ﬁrst,and subsequentially reﬁned in higher resolutions.Dealing
with images of 4096 1000 pixels,the implementation used regions of interest (ROI)
of 256 256 pixels or 512 512 pixels for image registration.Using NVidia GeForce
GTX285,a realtime performance (30 frames per second) was achieved.
Classical representatives of the areabased methods are correlation methods,based
on Fourier shift theorem.Thus,they can be efﬁciently computed in a Fourier domain,
using the fast Fourier transform;on the other hand,they can be used only in the case
of simple transformations,such as shift,rotation,and scaling.Acceleration on GPU is
therefore limited by an efﬁcient implementation of FFT algorithm.The CUFFT library
by NVidia provides framework for computing FFT on GPU [CUD10b].However,the
speedup is bounded by memory bandwidth and data transfer overhead.Also,since
the FFT is sensitive to accuracy,and the graphics hardware performs best in single
precision,programmers have to pay attention to error rates.Nukada et al.[NOEM08]
proposed a custom kernel for 3D FFT,written in CUDA,claiming 3 faster perfor
mance compared to CUFFT.However,they assumed image sizes to be power of two,
therefore,the performance for general image sizes is unknown.A more general,well
described implementation was presented by Govindaraju et al.in [GLD
+
08],achieving
2–4 speedup over CUFFT library.
2.7 Decomposition of Large Images in Image Processing
In certain applications,namely MRI,CT,and confocal microscopy,huge 3D images
need to be acquired and reconstructed.In particular,this can mean up to gigabytes
of data per a single image.Even with 2D images,excessively large resolution can be
achieved,such as in Earth observation (satellite or panoramic views) or in pathology
(images of tissues).For such data,image processing is difﬁcult and requires advanced
approaches,with respect to optimizations of memory usage.This is particularly true
for GPU implementations,since graphics hardware have relatively small memory,as
discussed in Introduction.
24
Sometimes,the solution is to simply buy a card with larger memory.For example,
in [CTMT10],the algorithm is tested on NVidia GeForce GTX 280 with 1 GB of video
memory for 2D images,and on NVidia Tesla C1060 with 4 GB of memory for larger
3D images.It can be also mentioned that the latter card achieves a slightly slower
performance,due to a lower memory bandwidth.However,this is not a versatile
solution,indeed.While the size of memory space on graphics hardware is constantly
increasing,so does the resolution of acquiring devices,introducing a"race"between
image and memory sizes.Furthermore,highend graphics hardware is not always
affordable for every user.Thus,it is advisable to search for optimizations of image
processing methods,in order to be able to decompose images into smaller parts.
A convolution and deconvolution are typical examples of operations,where a data
decomposition can be performed in a straitghforward,but suboptimal manner.Due to
convolution properties,the (de)convolved image can be divided into arbitrarily small
parts,but all the subimages have to be extended with neighbouring pixels of at least
size of the ﬁlter kernel (PSF),as described in [TH78].Thus,a lot of redundant com
putation needs to be performed,proportionaly to the PSF size.This approach was
succesfully used in [BRHM96],to compute spatiallyvariant deconvolution of Hubble
Space Telescope images.
In MRI and tomographic recontructions,a division of dataset can be used to per
form computation on multiple processing units,thereby accelerating timeconsuming
algorithms.In [SHT
+
08],a multiGPU implementation of MRI recontruction was
tested on four NVidia Quadro graphics cards.The key part of the algorithmwas time
consuming computation of excessively large vector.Therefore,a simple data splitting
could be used.Slightly sublinear speedup was achieved,due to an overhead required
to arrange the data.Jang et al.[JKDP09] proposed an algorithm for iterative tomo
graphic reconstruction on multiple GPUs,namely on NVidia Tesla S870,which con
sisted of four GPUs.A detailed description of multiGPU utilization,however,was not
introduced in the paper.
Another example of decomposition can be found in [SA09].In this paper,Sharma
and Anton proposed a level set method for 3D reconstruction of large acoustic un
derwater images.With multibeam echo sounders,a series of 2D images is acquired,
forming a 3D volume of nearly 100 million voxels.These data use about 470 MB of
memory along with the same amount of memory for storing the signed distance ﬁeld.
Therefore,the volume was divided into small subvolumes and kept in 3D textures of
128
3
voxels.Then,the chunks of data were processed independently,as described in
the following paragraph.
First,the signed distance transform was computed.However,it could not be im
plemented in a straightforward manner,since it is a global operation.Instead,a lo
cal approximation of the Euclidean distance was performed,using the Chamfer dis
tance.During every single pass,the information about the distance is propagated using
3 33 neighbourhood voxels.Hence,every subvolume need to be supported with
a one voxel cover from other adjointing subvolumes,thereby reducing the computa
tional domain to a volume of 126
3
pixels.Second,average intensities were computed,
by means of reduction.This does not require an advanced approached to handle the
subvolumes.Third,the weighted median was calculated,using iterative method.Fi
25
nally,the PDE solver was executed,using information froma 5 55 neighbourhood,
thus reducing the computational domain further down to 124
3
pixels.
Takizawa and Kobayashi [TK06] proposed a kmeans algorithmon a cluster of mul
tiple computers,supported by graphics hardware.Their method was based on three
level hierarchical approach to parallel processing.On the top,the divideandconquer
approach was employed in two levels.In this manner,largescale data were divided
into partlyparallel subtasks,each performed by an individual PC.After obtaining the
data,the kmeans algorithm was executed in two phases.One part,the nearest neigh
bour seach,was implemented on GPU;the other one was performed on CPU,since it
involved random access memory writes and conditional branches,that are unappro
priate for GPU acceleration.The algorithm was evaluated on various conﬁgurations.
The GPUassisted implementation on a single PC achieved 4 speedup over the CPU
only version.Further performance gain,attained by 4 computers,was approximately
3 ,showing relatively good scalability on PC clusters.
Hartley et al.[HCR
+
08] designed a cooperative enviroment for the analysis of
pathology images.The analysis included extraction of four second order statistical fea
tures,namely contrast,correlation,homogeneity,and energy.For the purpose of tex
ture analysis,the local binary pattern (LBP) feature was computed.Since one slice
of a tissue can be up to 30 gigabytes of uncompressed data,the computation would
take hours on a single CPU.Therefore,the images were decomposed into tiles of
1000 1000 pixels and processed independently and in parallel on a cluster of 16
nodes.Each node was equipped with two dualcore AMD Opteron 2218 CPUs and
two NVidia Quadro FX 5600 GPUs.Additional six computers were used as reader
nodes in order to provide enough disk bandwidth.The results revealed TIFF decom
pression stage to be a signiﬁcant bottleneck when two graphics cards per node were
used.Three different image sizes were tested,"small",middle"and"large",with res
olution of 2.1,4.7,and 8.5 gigapixels,respectively.The total computation times for
the particular image sizes were under 4,7,and 12 seconds,respectively.Hereat,the
analysis on a single CPU would take almost 3 hours for the large image size.
For the parallelization in the work above,the DataCutter middleware framework by
Beynon et al.[Dat01] was used.This enviroment was designed to support processing
of large scientiﬁc datasets stored in archival storage systems across a network.After
having previously developped Active Data Repository (ADR),Beynon et al.proposed
DataCutter to extend and apply the features of ADR,in particular,to provide data pro
cessing on machines separated from storage servers.In the DataCutter architecture,
clients are supported by two important services,namely indexing and ﬁltering.The
former was proposed to retrieve spatial subsets from large datasets.Since the dataset
sizes could be up to petabytes,a multilevel hierarchical indexing scheme was used,
based on Rtrees structure.The latter was employed to performnonspatial subsetting
and data aggregation.For experimental testing of the DataCutter framework,the Vir
tual Microscope application was developed,to emulate a high power light microscope.
For detail information refer to [BKC
+
01].
26
3 Dissertation Thesis Intent
The main intent of my dissertation thesis will be mapping of image processing algo
rithms to GPU.I will focus on processing of large 3D images,which we deal with in
the CBIAlaboratory.As the output of the thesis,the methods will be implemented and
included in the i3dalgo library,which is developed by our laboratory.Many parallel
programming models,as known to date,require considerably high effort for the GPU
implementation [CDMS
+
08].Considering the tradeoff between computational speed
and time consumption for software developping,I decided to choose CUDA,for it
provides good framework for programmers and,generally,signiﬁcant speedups of im
plementations.If there will be time enough,OpenCL implementations will be studied
as well,in order to provide a comparison between the two most popular frameworks.
Since the algorithms will be applied in the light microscopy,the following image pro
cessing methods,essential for this ﬁeld,will be studied.
Image Restoration.Convolution and deconvolution are two related methods used
for image restoration.The former is employed in CytoPacq,the online framework for
simulating ﬂuorescence microscopy images [SKS09],to imitate a blurring introduced
by an optical system of a microscope.In contrast,the latter is used to reconstruct the
original image fromthe blurred one.
Image Segmentation.Image segmentation is the fundamental part of image pro
cessing in biomedical application.Therefore,several segmentation algorithms will
be considered,based on the methods from mathematical morphology (described be
low).These methods can be signiﬁcantly timeconsuming,the global knowledge is
needed during computation in some cases,therefore,they are challenging for GPU
implementation.
Morphological Processing.The implementations of basic algorithms,such as dila
tion,erosion,opening,and closing,have been already wellstudied and described in lit
erature.Thus,more advanced approaches will be considered,for example,watershed
transform,morphological reconstructions and reconstructionbased operators.These
algorithms are often used for image segmentation,either for preprocessing,or for
segmentation itself.The wellknown bottleneck of the watershed is oversegmentation,
thus,the preprocessing is an essential part of the segmentation process.
Schedule
I plan to ﬁnish my Ph.D.studies by dissertation defence in the termautumn 2012 (after
8 terms of doctoral studies).The expected time schedule of my work comes as follows:
Spring 2011—an Erasmus internship in foreign research centre.The methods
frommathematical morphology will be studied.The particular algorithms will be
chosen for my research topic,to be applied for segmentation of biomedical data.
Subsequently,the research on parallelization of the algorithms will be conducted.
27
Autumn 2011—I will proceed with previous work on morphological process
ing and image segmentation.Also,deconvolution methods will be examined.
Since large ﬁlter kernels (PSFs) are typical for the light microscopy,deconvolu
tion based on the fast Fourier transformwill be studied,in particular.
Spring 2012 —The deconvolution algorithms will be implemented on GPU and
applied on biomedical data.Both real data fromoptical microscopy and artiﬁcial
data fromthe Cytopacq simulator [SKS09] will be taken into account.
Autumn 2012 —Dissertation writeup and defense.
During the Ph.D.studies,all results of the previous work should be published on
an international conference or in a highly rated journal.
28
References
[AE10] AccelerEyes  MATLAB GPU Computing.http://www.accelereyes.
com,2010.
[AMD10] AMD FireStream
TM
9270 GPU Compute Accelerator.http://www.amd.
com/us/products/workstation/firestream/firestream9270/pages/
firestream9270.aspx,Jul 2010.
[ATI10a] ATI Stream SDK OpenCL Programming Guide.http://developer.
amd.com/gpu/ATIStreamSDK/assets/ATI_Stream_SDK_OpenCL_
Programming_Guide.pdf,Aug 2010.
[ATI10b] ATI Stream Software Development Kit (SDK) v2.1.http://developer.
amd.com/gpu/ATIStreamSDK/Pages/default.aspx,May 2010.
[BBB
ˇ
C10] Jiˇrí Barnat,Petr Bauch,Luboš Brim,and Milan
ˇ
Ceška.Computing
Strongly Connected Components in Parallel on CUDA.Technical Re
port FIMURS201010,Faculty of Informatics,Masaryk University,Brno,
2010.
[BFH
+
04] Ian Buck,TimFoley,Daniel Horn,Jeremy Sugerman,Kayvon Fatahalian,
Mike Houston,and Pat Hanrahan.Brook for GPUs:Stream Computing
on Graphics Hardware.ACMTRANSACTIONS ON GRAPHICS,23:777–
786,2004.
[BFRR01] Thomas Bräunl,Stefan Feyrer,Wolfgang Rapf,and Michael Reinhardt.
Parallel Image Processing.Springer,2001.
[BK03] Mario Botsch and Leif Kobbelt.HighQuality PointBased Rendering
on Modern GPUs.In PG ’03:Proceedings of the 11th Paciﬁc Conference
on Computer Graphics and Applications,page 335,Washington,DC,USA,
2003.IEEE Computer Society.
[BKC
+
01] Michael D.Beynon,Tahsin Kurc,Umit Catalyurek,Chialin Chang,Alan
Sussman,and Joel Saltz.Distributed processing of very large datasets
with DataCutter.Parallel Comput.,27(11):1457–1478,2001.
[BM91] S.Beucher and Centre De Morphologie Mathmatique.The Watershed
Transformation Applied To Image Segmentation.In Scanning Microscopy
International,pages 299–314,1991.
[BR96] Stephen Brown and Jonathan Rose.Architecture of FPGAs and CPLDs:
A Tutorial.IEEE Design and Test of Computers,13:42–57,1996.
[BRHM96] A.F.Boden,D.C.Redding,R.J.Hanisch,and J.Mo.Massively par
allel spatially variant maximumlikelihood restoration of Hubble Space
Telescope imagery.J.Opt.Soc.Am.A,13(7):1537–1545,1996.
29
[BTVG06] Herbert Bay,Tinne Tuytelaars,and Luc Van Gool.SURF:Speeded Up
Robust Features.In Aleš Leonardis,Horst Bischof,and Axel Pinz,ed
itors,Computer Vision – ECCV 2006,volume 3951 of Lecture Notes in
Computer Science,pages 404–417.Springer Berlin/Heidelberg,2006.
10.1007/11744023_32.
[BZZ09] S.Bonettini,R.Zanella,and L.Zanni.A scaled gradient projection
method for constrained image deblurring.Inverse Problems,25(1):015002,
2009.
[Cas96] Kenneth R.Castleman.Digital Image Processing.Prentice Hall,2nd edi
tion,1996.
[CBM
+
08] Shuai Che,Michael Boyer,Jiayuan Meng,David Tarjan,Jeremy W.Sheaf
fer,and Kevin Skadron.A performance study of generalpurpose appli
cations on graphics processors using CUDA.J.Parallel Distrib.Comput.,
68(10):1370–1380,2008.
[CC74] Christoph Cremer and Thomas Cremer.Considerations on a laser
scanningmicroscope with high resolution and depth of ﬁeld.1974.
[CDMS
+
08] Daniel Casta´noDíez,Dominik Moser,Andreas Schoenegger,Sabine
Pruggnaller,and Achilleas S.Frangakis.Performance evaluation of im
age processing algorithms on the GPU.Journal of Structural Biology,
164(1):153 – 160,2008.
[CLW04] Joshua E.Cates,Aaron E.Lefohn,and Ross T.Whitaker.GIST:an in
teractive,GPUbased level set segmentation tool for 3D medical images.
Medical Image Analysis,8(3):217 – 231,2004.Medical Image Computing
and ComputerAssisted Intervention  MICCAI 2003.
[Coe10] David Coeurjolly.Volumetric analysis of digital objects using dis
tance transformation:performance issues and applications,August 2010.
Workshop on Applications of Digital Geometry and Mathematical Mor
phology.
[CTMT10] ThanhTung Cao,Ke Tang,Anis Mohamed,and TiowSeng Tan.Parallel
Banding Algorithm to compute exact distance transform with the GPU.
In I3D ’10:Proceedings of the 2010 ACM SIGGRAPH symposium on Inter
active 3D Graphics and Games,pages 83–90,New York,NY,USA,2010.
ACM.
[CUD10a] CUDA
TM
CUBLAS Library 3.1.http://developer.download.nvidia.
com/compute/cuda/3_1/toolkit/docs/CUBLAS_Library_3.1.pdf,Jun
2010.
[CUD10b] CUDA
TM
CUFFT Library 3.1.http://developer.download.nvidia.
com/compute/cuda/3_1/toolkit/docs/CUFFT_Library_3.1.pdf,Jun
2010.
30
[CUD10c] CUDA
TM
Programming Guide 3.1.http://developer.download.
nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_CUDA_C_
ProgrammingGuide_3.1.pdf,Jun 2010.
[CUD10d] CUDA
TM
SDK Code Samples 3.1.http://developer.nvidia.com/
object/cuda_sdk_samples.html,Jun 2010.
[CUD10e] NVIDIA GPU Computing Developer Home Page.http://developer.
nvidia.com/object/gpucomputing.html,Jun 2010.
[CV01] Tony F.Chan and Luminita A.Vese.Alevel set algorithmfor minimizing
the MumfordShah functional in image processing.In VLSM’01:Proceed
ings of the IEEE Workshop on Variational and Level Set Methods (VLSM’01),
page 161,Washington,DC,USA,2001.IEEE Computer Society.
[CVG08] N.Cornelis and L.Van Gool.Fast scale invariant feature detection and
matching on programmable graphics hardware.In Computer Vision and
Pattern Recognition Workshops,2008.CVPRW ’08.IEEE Computer Society
Conference on,pages 1 –8,2328 2008.
[Dat01] DataCutter Project.http://www.cs.umd.edu/projects/hpsl/
ResearchAreas/DataCutter.htm,2001.
[DDdV10] Bernardt Duvenhage,J.P.Delport,and Jason de Villiers.Implementa
tion of the LucasKanade image registration algorithm on a GPU for 3D
computational platform stabilisation.In AFRIGRAPH ’10:Proceedings of
the 7th International Conference on Computer Graphics,Virtual Reality,Visu
alisation and Interaction in Africa,pages 83–90,New York,NY,USA,2010.
ACM.
[DMAC03] F.Drago,K.Myszkowski,T.Annen,and N.Chiba.Adaptive Logarithmic
Mapping For Displaying High Contrast Scenes.Computer Graphics Forum,
22:419–426,2003.
[DPRS01] Udo Diewald,Tobias Preusser,Martin Rumpf,and Robert Strzodka.Dif
fusion Models And Their Accelerated Solution In Image And Surface
Processing.Acta Math.Univ.Comenian,70:15–34,2001.
[DS04] Carsten Dachsbacher and Marc Stamminger.Rendering Procedural Ter
rain by Geometry Image Warping.In Rendering Techniques,pages 103–
110,2004.
[Duc76] J.Duchon.Interpolation des fonctions de deux variables suivant le
principe de la ﬂexion des plaques minces.R.A.I.R.O.Analyse num
’erique,10:5–12,1976.
[DVW09] L.Domanski,P.Vallotton,and D.Wang.Two and ThreeDimensional
Image Deconvolution on Graphics Hardware.In Proceedings of the 18th
World IMACS/MODSIMCongress,pages 1010–1016,Cairns,Australia,13
17 July 2009.
31
[ESA05] O.C.Eidheim,J.Skjermo,and L.Aurdal.Realtime analysis of ultra
sound images using GPU.International Congress Series,1281:284 – 289,
2005.CARS 2005:Computer Assisted Radiology and Surgery.
[FEK
+
05] Quanfu Fan,Alon Efrat,Vladlen Koltun,Shankar Krishnan,and Suresh
Venkatasubramanian.Hardwareassisted natural neighbor interpola
tion.In In:Proc.7th Workshop on Algorithm Engineering and Experiments
(ALENEX),2005.
[FJ09] Matteo Frigo and Steven G.Johnson.FFTW 3.2.2.http://www.fftw.
org/fftw3.pdf,Jul 2009.
[FM05] James Fung and Steve Mann.OpenVIDIA:parallel GPU computer vi
sion.In MULTIMEDIA ’05:Proceedings of the 13th annual ACM interna
tional conference on Multimedia,pages 849–852,NewYork,NY,USA,2005.
ACM.
[FR] James Fung and John Roberts.OpenVIDIA:Parallel GPU Computer Vi
sion.http://openvidia.sourceforge.net/index.php/OpenVIDIA.
[Fra04] François Goudail and Philippe Réfrégier.Statistical image processing tech
niques for noisy images:an applicationoriented approach.Springer,2004.
[Fun05] J.Fung.Computer Vision on the GPU,chapter 40,pages 649–665.Addison
Wesley,2005.
[F
ˇ
C06] O.Fialka and M.
ˇ
Cadík.FFT and Convolution Performance in Image
Filtering on GPU.In Information Visualization,2006.IV 2006.Tenth Inter
national Conference on,pages 609 –614,57 2006.
[GA09] Leo Grady and Christopher V.Alvino.The piecewise smooth Mumford
Shah functional on an arbitrary graph.Trans.Img.Proc.,18(11):2547–2561,
2009.
[GLD
+
08] Naga K.Govindaraju,Brandon Lloyd,Yuri Dotsenko,Burton Smith,
and John Manferdelli.High performance discrete Fourier transforms
on graphics processors.In SC ’08:Proceedings of the 2008 ACM/IEEE con
ference on Supercomputing,pages 1–12,Piscataway,NJ,USA,2008.IEEE
Press.
[GLG05] Minglun Gong,Aaron Langille,and Mingwei Gong.Image Analysis
and Recognition,volume 3656/2005,chapter RealTime Image Process
ing Using Graphics Hardware:A Performance Study,pages 1217–1225.
Springer Berlin/Heidelberg,2005.
[GLGN
+
08] Michael Garland,Scott Le Grand,John Nickolls,Joshua Anderson,Jim
Hardwick,Scott Morton,Everett Phillips,Yao Zhang,and Vasily Volkov.
Parallel Computing Experiences with CUDA.IEEE Micro,28(4):13–27,
2008.
32
[GMH
+
09] Markus Gipp,Guillermo Marcus,Nathalie Harder,Apichat Suratanee,
Karl Rohr,Rainer König,and Reinhard Männer.Haralick’s Texture Fea
tures Computed by GPUs for Biological Applications.2009.
[GPG] GeneralPurpose computation on Graphics Processing Units.http://
www.gpgpu.org.
[Gri10] Future of computing:GPGPU?http://gridtalkproject.blogspot.
com/2010/07/futureofcomputinggpgpu.html,Jul 2010.
[GS05] Antonio Garcia and HanWei Shen.GPUbased 3D wavelet recon
struction with tileboarding.The Visual Computer,21:755–763,2005.
10.1007/s0037100503320.
[GW02] Rafael C.Gonzales and Richard E.Woods.Digital Image Processing.
PrenticeHall,2nd edition,2002.
[GWWH05] Nolan Goodnight,Rui Wang,Cliff Woolley,and Greg Humphreys.In
teractive timedependent tone mapping using programmable graphics
hardware.In SIGGRAPH ’05:ACM SIGGRAPH 2005 Courses,page 180,
New York,NY,USA,2005.ACM.
[Hal08] Haldar,J.P.and Hernando,D.and Song,S.K.and Liang,Z.P.Anatom
ically constrained reconstruction from noisy data.Magnetic Resonance in
Medicine,59:810–818,2008.
[HB08] S.Harding and W.Banzhaf.Genetic programming on GPUs for image
processing.Int.J.High Perform.Syst.Archit.,1(4):231–240,2008.
[HCR
+
08] Timothy D.R.Hartley,Umit Catalyurek,Antonio Ruiz,Francisco Igual,
Rafael Mayo,and Manuel Ujaldon.Biomedical image analysis on a co
operative cluster of GPUs and multicores.In ICS ’08:Proceedings of the
22nd annual international conference on Supercomputing,pages 15–25,New
York,NY,USA,2008.ACM.
[HE00] Matthias Hopf and Thomas Ertl.Hardware Accelerated Wavelet Trans
formations.In In Proceedings of EG/IEEE TCVGSymposiumon Visualization
VisSym ’00,pages 93–103,2000.
[HLSB04] Markus Hadwiger,Caroline Langer,Henning Scharsach,and Katja Büh
ler.State of the Art Report 2004 on GPUBased Segmentation.Technical
report,VRVis Research Center,Vienna,2004.
[HN07] Pawan Harish and P.J.Narayanan.Accelerating large graph algorithms
on the GPU using CUDA.In HiPC’07:Proceedings of the 14th international
conference on High performance computing,pages 197–208,Berlin,Heidel
berg,2007.SpringerVerlag.
[How10] Lee Howes.OpenCL:Parallel Computing for CPUs and
GPUs.http://developer.amd.com/gpu_assets/OpenCL_Parallel_
Computing_for_CPUs_and_GPUs_201003.pdf,Mar 2010.
33
[HS92] Stefan Hell and Ernst H.K.Stelzer.Properties of a 4Pi confocal ﬂuores
cence microscope.J.Opt.Soc.Am.A,9(12):2159–2166,1992.
[Jäh05] Bernd Jähne.Digital Image Processing.Springer,6th edition,2005.
[JKDP09] Byunghyun Jang,David Kaeli,Synho Do,and Homer Pien.Multi GPU
implementation of iterative tomographic reconstruction algorithms.In
ISBI’09:Proceedings of the Sixth IEEE international conference on Symposium
on Biomedical Imaging,pages 185–188,Piscataway,NJ,USA,2009.IEEE
Press.
[Joh07] John E.Stone and James C.Phillips and Peter L.Freddolino and David
J.Hardy and Leonardo G.Trabuco and Klaus Schulten.Accelerating
molecular modeling applications with graphics processors.Journal of
Computational Chemistry,28(16):2618–2640,2007.
[KDY
+
10] Jingfei Kong,Martin Dimitrov,Yi Yang,Janaka Liyanage,Lin Cao,Jacob
Staples,Mike Mantor,and Huiyang Zhou.Accelerating MATLAB Image
Processing Toolbox functions on GPUs.In GPGPU ’10:Proceedings of the
3rd Workshop on GeneralPurpose Computation on Graphics Processing Units,
pages 75–85,New York,NY,USA,2010.ACM.
[KEGM10] Dimitri Komatitsch,Gordon Erlebacher,Dominik Göddeke,and David
Michéa.Highorder ﬁniteelement seismic wave propagation modeling
with MPI on a large GPU cluster.Journal of Computational Physics,In
Press,Corrected Proof:–,2010.
[KES07] Martin Kraus,Mike Eissele,and Magnus Strengert.GPUBased Edge
Directed Image Interpolation.In Bjarne Ersbøll and Kim Pedersen,ed
itors,Image Analysis,volume 4522 of Lecture Notes in Computer Science,
pages 532–541.Springer Berlin/Heidelberg,2007.10.1007/9783540
730408_54.
[Kim05] Kimmel,Ron and Sochen,Nir and Weickert,Joachim,editor.Scale Space
and PDE Methods in Computer Vision.Springer,2005.Proceedings of the
5th International Conference on Scale Space and PDE Methods in Com
puter Vision,ScaleSpace 2005.
[KK04] Francis Kelly and Anil Kokaram.Fast Image Interpolation for Motion
Estimation using.In in Real Time Imaging VIII,SPIE,pages 184–194,2004.
[KmH10] David Kirk and Wen mei Hwu.Programming Massively Parallel Processors:
A Handson Approach.Morgan Kaufmann,2010.
[KP08] C.Kauffmann and N.Piche.Cellular automaton for ultrafast watershed
transform on GPU.In Pattern Recognition,2008.ICPR 2008.19th Interna
tional Conference on,pages 1 –4,811 2008.
[KW03] Jens Krüger and Rüdiger Westermann.Linear algebra operators for GPU
implementation of numerical algorithms.In SIGGRAPH ’03:ACM SIG
GRAPH 2003 Papers,pages 908–916,New York,NY,USA,2003.ACM.
34
[LCW03] Aaron Lefohn,Joshua Cates,and Ross Whitaker.Interactive,GPUBased
Level Sets for 3D Segmentation.pages 564–572.2003.
[LD08] Yuancheng Luo and R.Duraiswami.Canny edge detection on NVIDIA
CUDA.In Computer Vision and Pattern Recognition Workshops,2008.
CVPRW’08.IEEE Computer Society Conference on,pages 1 –8,2328 2008.
[LDS05] David Levin,Damini Dey,and Piotr Slomka.Efﬁcient 3D nonlinear
warping of computed tomography:two highperformance implemen
tations using OpenGL.volume 5744,pages 34–42.SPIE,2005.
[LDZZ09] Baofeng Li,Yong Dou,Haifang Zhou,and Xingming Zhou.FPGA accel
erator for waveletbased automated global image registration.EURASIP
J.Embedded Syst.,2009:1–10,2009.
[LKC
+
10] Victor W.Lee,Changkyu Kim,Jatin Chhugani,Michael Deisher,Dae
hyun Kim,Anthony D.Nguyen,Nadathur Satish,Mikhail Smelyanskiy,
Srinivas Chennupaty,Per Hammarlund,Ronak Singhal,and Pradeep
Dubey.Debunking the 100X GPU vs.CPU myth:an evaluation of
throughput computing on CPU and GPU.SIGARCH Comput.Archit.
News,38(3):451–460,2010.
[Luc74] L.B.Lucy.An iterative technique for the rectiﬁcation of observed distri
butions.The Astronomical Journal,79:745–+,June 1974.
[Mar96] Petros A.Maragos.Mathematical Morphology and Its Applications to Image
and Signal Processing.Kluwer Academic Publishers,Norwell,MA,USA,
1996.
[MBS09] Maryam Moazeni,Alex Bui,and Majid Sarrafzadeh.Accelerating total
variation regularization for matrixvalued images on GPUs.In CF ’09:
Proceedings of the 6th ACM conference on Computing frontiers,pages 137–
146,New York,NY,USA,2009.ACM.
[MDTP
+
04] Michael McCool,Stefanus Du Toit,Tiberiu Popa,Bryan Chan,and Kevin
Moule.Shader algebra.In SIGGRAPH’04:ACMSIGGRAPH2004 Papers,
pages 787–795,New York,NY,USA,2004.ACM.
[MOOXS08] P.MuyanOzcelik,J.D.Owens,Junyi Xia,and S.S.Samant.Fast De
formable Registration on the GPU:ACUDAImplementation of Demons.
In Computational Sciences and Its Applications,2008.ICCSA ’08.Interna
tional Conference on,pages 223 –233,june 2008.
[NOEM08] Akira Nukada,Yasuhiko Ogata,Toshio Endo,and Satoshi Matsuoka.
Bandwidth intensive 3D FFT kernel for GPUs using CUDA.In SC ’08:
Proceedings of the 2008 ACM/IEEE conference on Supercomputing,pages 1–
11,Piscataway,NJ,USA,2008.IEEE Press.
[Nyl07] Nyland,Lars and Harris,Mark and Prins,Jan.Fast NBody Simulation
with CUDA.In Hubert Nguyen,editor,GPUGems 3,chapter 31.Addison
Wesley Professional,August 2007.
35
[OLG
+
05] John D.Owens,David Luebke,Naga Govindaraju,Mark Harris,Jens
Krüger,Aaron E.Lefohn,and Timothy J.Purcell.A Survey of General
Purpose Computation on Graphics Hardware.pages 21–51,August 2005.
[OLG
+
07] John D.Owens,David Luebke,Naga Govindaraju,Mark Harris,Jens
Krüger,Aaron Lefohn,and Timothy J.Purcell.A survey of general
purpose computation on graphics hardware.Computer Graphics Forum,
26(1):80–113,2007.
[Ope10a] OpenCL.http://www.khronos.org/opencl/,Jun 2010.
[Ope10b] OpenCL 1.1 Reference Pages.http://www.khronos.org/registry/cl/
sdk/1.1/docs/man/xhtml/,Jun 2010.
[Ope10c] The OpenCL Speciﬁcation 1.1.http://developer.amd.com/gpu/
ATIStreamSDK/assets/opencl1.1rev33.pdf,Jun 2010.
[PDHW10] Brad Peterson,Manasi Datar,Mary Hall,and Ross Whitaker.GPU Ac
celerated Particle Systemfor Triangulated Surface Meshes.Jul 2010.
[PGS
+
09] Alexandros Papakonstantinou,Karthik Gururaj,John A.Stratton,Dem
ing Chen,Jason Cong,and WenMei W.Hwu.FCUDA:Enabling efﬁcient
compilation of CUDAkernels onto FPGAs.Application Speciﬁc Processors,
Symposium on,0:35–42,2009.
[PGX08] Lei Pan,Lixu Gu,and Jianrong Xu.Implementation of medical im
age segmentation in CUDA.In Information Technology and Applications
in Biomedicine,2008.ITAB 2008.International Conference on,pages 82 –85,
3031 2008.
[Pod07] Victor Podlozhnyuk.Image Convolution with CUDA.http:
//developer.download.nvidia.com/compute/cuda/sdk/website/
projects/convolutionSeparable/doc/convolutionSeparable.pdf,
Jun 2007.
[Pra01] WilliamK.Pratt.Digital Image Processing.John Wiley &Sons,3rd edition,
2001.
[PSL
+
10] In Kyu Park,Nitin Singhal,Man Hee Lee,Sungdae Cho,and Chris Kim.
Design and Performance Evaluation of Image Processing Algorithms on
GPUs.IEEE Transactions on Parallel and Distributed Systems,99(PrePrints),
2010.
[QFTI09] Cory W.Quammen,David Feng,and Russell M.Taylor II.Performance
of 3D Deconvolution Algorithms on MultiCore and ManyCore Archi
tectures.Technical report,University of North Carolina at Chapel Hill,
Department of Computer Science,2009.
[RIC72] WILLIAM HADLEY RICHARDSON.BayesianBased Iterative Method
of Image Restoration.J.Opt.Soc.Am.,62(1):55–59,1972.
36
[RM07] Matthias Raspe and Stefan Müller.Using a GPUbased Framework for
Interactive Tone Mapping of Medical Volume Data.In SIGRAD 2007:
Computer Graphics in Healthcare,pages 3–10.Linköping University Elec
tronic Press,Linköpings universitet,2007.
[ROF92] Leonid I.Rudin,Stanley Osher,and Emad Fatemi.Nonlinear total vari
ation based noise removal algorithms.Phys.D,60(14):259–268,1992.
[RPZ02] Liu Ren,Hanspeter Pﬁster,and Matthias Zwicker.Object Space EWA
Surface Splatting:A Hardware Accelerated Approach to High Quality
Point Rendering.In Computer Graphics Forum,pages 461–470,2002.
[RRB
+
08] Shane Ryoo,Christopher I.Rodrigues,Sara S.Baghsorkhi,SamS.Stone,
David B.Kirk,and Wenmei W.Hwu.Optimization principles and ap
plication performance evaluation of a multithreaded GPU using CUDA.
In PPoPP ’08:Proceedings of the 13th ACMSIGPLAN Symposium on Princi
ples and practice of parallel programming,pages 73–82,NewYork,NY,USA,
2008.ACM.
[RSSF02] Erik Reinhard,Michael Stark,Peter Shirley,and James Ferwerda.Pho
tographic tone reproduction for digital images.ACM Trans.Graph.,
21(3):267–276,2002.
[RT06] Guodong Rong and TiowSeng Tan.Jump ﬂooding in GPUwith applica
tions to Voronoi diagramand distance transform.In I3D ’06:Proceedings
of the 2006 symposium on Interactive 3D graphics and games,pages 109–116,
New York,NY,USA,2006.ACM.
[RUCH09] Antonio Ruiz,Manuel Ujaldon,Lee Cooper,and Kun Huang.Nonrigid
Registration for Large Sets of Microscopic Images on Graphics Proces
sors.J.Signal Process.Syst.,55(13):229–250,2009.
[RUG07] Antonio Ruiz,Manuel Ujaldón,and Nicolás Guil.Using Graphics Hard
ware for Enhancing Edge and Circle Detection.In Joan Martí,José
Benedí,Ana Mendonça,and Joan Serrat,editors,Pattern Recognition and
Image Analysis,volume 4478 of Lecture Notes in Computer Science,pages
234–241.Springer Berlin/Heidelberg,2007.10.1007/978354072849
8_30.
[SA09] Ojaswa Sharma and François Anton.CUDA based Level Set Method
for 3D Reconstruction of Fishes from Large Acoustic Data.In Interna
tional Conference in Central Europe on Computer Graphics,Visualization and
Computer Vision (WSCG),2009.
[SBW06] Jens Schneider,Tobias Boldte,and Rüdiger Westermann.RealTime Edit
ing,Synthesis,and Rendering of Inﬁnite Landscapes on GPUs.In Vision,
Modeling and Visualization 2006,2006.
37
[SDR03] Robert Strzodka,Marc Droske,and Martin Rumpf.Fast Image Reg
istration in DX9 Graphics Hardware.Journal of Medical Informatics and
Technologies,6:43–49,Nov 2003.
[SDR04] Robert Strzodka,Marc Droske,and Martin Rumpf.Image Registration
by a Regularized Gradient Flow  A Streaming Implementation in DX9
Graphics Hardware.Computing,73(4):373–389,November 2004.
[Ser83] Jean Serra.Image Analysis and Mathematical Morphology.Academic Press,
Inc.,Orlando,FL,USA,1983.
[SG04] Robert Strzodka and Christoph Garbe.RealTime Motion Estimation and
Visualization on Graphics Cards.In Proceedings IEEE Visualization 2004,
pages 545–552,2004.
[SHN03] Anthony Sherbondy,Mike Houston,and Sandy Napel.Fast Volume
Segmentation With Simultaneous Visualization Using Programmable
Graphics Hardware.In VIS ’03:Proceedings of the 14th IEEE Visualiza
tion 2003 (VIS’03),page 23,Washington,DC,USA,2003.IEEE Computer
Society.
[SHT
+
08] S.S.Stone,J.P.Haldar,S.C.Tsao,W.m.W.Hwu,B.P.Sutton,and Z.P.
Liang.Accelerating advanced MRI reconstructions on GPUs.J.Parallel
Distrib.Comput.,68(10):1307–1318,2008.
[SKE06] Magnus Strengert,Martin Kraus,and Thomas Ertl.Pyramid Methods in
GPUBased Image Processing.pages 169–176,2006.
[SKS09] David Svoboda,Michal Kozubek,and Stanislav Stejskal.Generation of
Digital Phantoms of Cell Nuclei and Simulation of Image Formation in
3D Image Cytometry.CYTOMETRY PART A,75A(6):494–509,JUN 2009.
[SKSF07] G C Sharp,N Kandasamy,H Singh,and M Folkert.GPUbased
streaming architectures for fast conebeam CT image reconstruction
and demons deformable registration.Physics in Medicine and Biology,
52(19):5771,2007.
[SM06] JeanLuc Starck and Fionn Murtagh.Astronomical Image and Data Analy
sis.Springer,2006.
[SM08] H.Sugano and R.Miyamoto.Parallel implementation of morphological
processing on Cell/BE with OpenCV interface.In Communications,Con
trol and Signal Processing,2008.ISCCSP 2008.3rd International Symposium
on,pages 578 –583,1214 2008.
[SN06] P.Sarder and A.Nehorai.Deconvolution methods for 3D ﬂuorescence
microscopy images.Signal Processing Magazine,IEEE,23(3):32 – 45,may
2006.
38
[SV82] L.A.Shepp and Y.Vardi.MaximumLikelihood Reconstruction for Emis
sion Tomography.Medical Imaging,IEEE Transactions on,1(2):113–122,
1982.
[SZZ10] Thomas Seraﬁni,Riccardo Zanella,and Luca Zanni.Gradient projection
methods for image deblurring and denoising on graphics processors.
Advances in Parallel Computing,19:59–66,2010.
[Tes09] Tesla C1060 Computing Processor Board.http://www.nvidia.com/
docs/IO/43395/BD04111001_v05.pdf,Jan 2009.
[Tes10] Tesla C2050/C2070 GPU Computing Processor.http://www.nvidia.
com/object/product_tesla_C2050_C2070_us.html,Jul 2010.
[TH78] H.Trussell and B.Hunt.Image restoration of space variant blurs by
sectioned methods.In Acoustics,Speech,and Signal Processing,IEEE Inter
national Conference on ICASSP ’78.,volume 3,pages 196 – 198,apr 1978.
[TK06] Hiroyuki Takizawa and Hiroaki Kobayashi.Hierarchical parallel process
ing of large scale data clustering on a PCcluster with GPUcoprocessing.
J.Supercomput.,36(3):219–234,2006.
[Uja09] Manuel Ujaldon.GPU acceleration of Zernike moments for largescale
images.In IPDPS ’09:Proceedings of the 2009 IEEE International Symposium
on Parallel&Distributed Processing,pages 1–8,Washington,DC,USA,2009.
IEEE Computer Society.
[vDvdWT06] M.van Dortmont,H.van de Wetering,and A.Telea.Skeletonization and
Distance Transforms of 3D Volumes Using Graphics Hardware.In Attila
Kuba,László Nyúl,and Kálmán Palágyi,editors,Discrete Geometry for
Computer Imagery,volume 4245 of Lecture Notes in Computer Science,pages
617–629.Springer Berlin/Heidelberg,2006.10.1007/11907350_52.
[Ver98] Peter J.Verveer.Computational and optical methods for improving resolu
tion and signal quality in ﬂuorescence microscopy.PhD thesis,Technische
Universiteit Te Delft,1998.
[VN08] Vibhav Vineet and P.J.Narayanan.CUDA cuts:Fast graph cuts on the
GPU.Computer Vision and Pattern Recognition Workshop,0:1–8,June 2008.
[Wie64] Norbert Wiener.Extrapolation,Interpolation,and Smoothing of Stationary
Time Series.The MIT Press,1964.
[WLHW07] TienTsin Wong,ChiSing Leung,PhengAnn Heng,and Jianqing Wang.
Discrete Wavelet Transform on ConsumerLevel Graphics Hardware.
Multimedia,IEEE Transactions on,9(3):668 –673,april 2007.
[WN] Piotr Wendykier and James G.Nagy.Parallel Colt:A High Performance
Java Library for Scientiﬁc Computing and Image Processing.ACMTrans
actions on Mathematical Software,37(3).
39
[XCZ09] Yongshun Xiao,Zhiqiang Chen,and Li Zhang.Accelerated CT Recon
struction Using GPU SIMD Parallel Computing with Bilinear Warping
Method.Information Science and Engineering,International Conference on,
0:95–98,2009.
[XM05] Fang Xu and K.Mueller.Accelerating popular tomographic reconstruc
tion algorithms on commodity PC graphics hardware.Nuclear Science,
IEEE Transactions on,52(3):654 – 663,june 2005.
[XM09] Wei Xu and Klaus Mueller.Accelerating regularized iterative CT recon
struction on commodity graphics hardware (GPU).In ISBI’09:Proceed
ings of the Sixth IEEE international conference on Symposium on Biomedical
Imaging,pages 1287–1290,Piscataway,NJ,USA,2009.IEEE Press.
[YNAS09] Lee Seng Yeong,Christopher Wing Hong Ngau,LiMinn Ang,and
Kah Phooi Seng.Efﬁcient processing of a rainfall simulation watershed
on an FPGAbased architecture with fast access to neighbourhood pixels.
EURASIP J.Embedded Syst.,2009:4–4,2009.
[YQ08] Lei Yu and Dawei Qi.Improved multiscale and structuring element
morphological detection in the log CT image.In Cybernetics and Intelligent
Systems,2008 IEEE Conference on,pages 1248 –1253,2124 2008.
[Zit03] Barbora Zitová.Image registration methods:a survey.Image and Vision
Computing,21(11):977–1000,October 2003.
[ZKU
+
04] C.Lawrence Zitnick,Sing Bing Kang,Matthew Uyttendaele,Simon
Winder,and Richard Szeliski.Highquality video view interpolation us
ing a layered representation.ACMTrans.Graph.,23(3):600–608,2004.
[ZLB
+
09] Jian Zhu,Youquan Liu,Kai Bao,Yuanzhang Chang,and Enhua Wu.
Warping of a spherical representation of imagebased models on GPU.In
VRCAI ’09:Proceedings of the 8th International Conference on Virtual Reality
Continuum and its Applications in Industry,pages 89–94,New York,NY,
USA,2009.ACM.
40
4 Summary of Study Results
During the ﬁrst three terms of doctoral study I succesfully passed courses focused on
image processing methods,namely Advanced Methods of Digital Image Processing
(FI:PA166),Digital Image Filtering (FI:PA171),Mathematical Morphology (FI:PA173),
and Digital Geometry (FI:PA170).I learned a number of methods from mathematical
morphology,image ﬁltering,edge detection,and image segmentation.
As a part of my study,I attended the Image Acquisition and Analysis in Microscopy
workshop in Prague in May,2009.I also attended the Spring School of Image Processing
in Mariánská in June,2010.
I read papers with results of past researches in the area of GPUacceleration.In par
ticular,I focused on implementations of image processing algorithms.Also I learned
the CUDAparallel programming model and succesfully passed the GPUProgramming
(FI:PV197) course.
I implemented several algorithms in CUDA,namely convolution,morphological re
construction,and a number of reconstructionbased operators (for detailed description
refer to the following paragraphs).At CBIA FI MU,we develop i3dcore and i3dalgo,
crossplatformC++ libraries for 3Dimage processing.These libraries are nowextended
to support GPUbased processing and the image processing algorithms implemented
in CUDA will be soon included in the libraries.
Convolution
I implemented convolution of large 3D images,based on the fast Fourier transform,on
GPU.The CUDAframework and the CUFFT library were used for the implementation.
With D.Svoboda we proposed a method to decompose large 3D images using the
decimation in frequency (DIF) algorithm,so that we are able to process images that are
too large to be stored in the GPU memory.The paper on the topic was submitted to
EURASIP Journal on Advances in Signal Processing [KS10].
Morphological reconstruction
I studied efﬁcient algorithms for computing the morphological reconstruction,de
scribed in literature.I proposed a GPUbased implementation in CUDA and com
pared the performance of all approaches.The paper on the topic was submitted to the
MEMICS 2010 conference [K10].
References to the submitted papers
[KS10] Pavel Karas and David Svoboda.Convolution of Large 3D Images on GPU and
its Decomposition.EURASIP Journal on Advances in Signal Processing.Hindawi,
submitted.
[K10] Pavel Karas.Efﬁcient Computation of Morphological Greyscale Reconstruc
tion.MEMICS 2010 Annual Doctoral Workshop on Mathematical and Engineering
Methods in Computer Science,submitted.
41
There is a theory which states that if ever anyone discovers exactly
what the Universe is for and why it is here,it will instantly disappear
and be replaced by something even more bizarre and inexplicable.
There is another theory which states that this has already happened.
Douglas Adams,The Restaurant at the End of the Universe
42
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο