Faculty of Informatics
Faculty of Informatics
Automatic Bug-ﬁnding Techniques for
Dissertation Thesis Topic
GPU Acceleration of
Image Processing Algorithms
Dissertation Thesis Topic
Centre for Biomedical Image Analysis
I would like to thank supervisors,both Doc.Kozubek and Dr.Svoboda for their
patience and their suggestions.I thank my colleagues for their ideas and inspiration.
My thanks belong also to my friends and parents for their endless support.
1 Introduction 1
2 Report on Current Results 5
2.1 Parallel Computing Architectures and Models................5
2.1.1 NVidia CUDA..............................5
2.1.2 ATI Stream................................7
2.1.4 Other architectures............................7
2.2 General-Purpose Computation on GPU....................9
2.3 Image Processing Acceleration.........................12
2.4 Image Formation.................................13
2.5 Image Enhancement...............................15
2.5.1 Tone Mapping...............................15
2.5.2 Image Restoration............................16
2.5.3 Geometrical Methods..........................17
2.6 Image Analysis..................................18
2.6.1 Morphological Processing........................18
2.6.2 Edge Detection..............................19
2.6.3 Image Segmentation...........................20
2.6.4 Feature Extraction............................21
2.6.5 Shape Analysis..............................23
2.6.6 Image Registration............................23
2.7 Decomposition of Large Images in Image Processing............24
3 Dissertation Thesis Intent 27
4 Summary of Study Results 41
Image processing is a rapidly developing ﬁeld based both on mathematics and com-
puter science.It is naturally connected to and strongly inﬂuenced by image acqui-
sition.Community of users is wide and includes not only common customers with
digital cameras,but also astronoms analysing data from telescopes and spacecrafts
[GW02] [SM06],specialists dealing with magnetic resonance imaging (MRI) or com-
puted tomography (CT) [Cas96] [GW02] [Jäh05],and biologists working with optical
microscopes [Ver98] [Jäh05].The users’ demands for image quality,in terms of res-
olution,signal-to-noise ratio or dynamic range,are constantly growing.As a result,
both the image acquisition devices and image processing algorithms are developing in
order to satisfy the users’ needs.
On the other hand,the complex image processing methods can be very time-
consuming and their computation can be challenging even for recent computer hard-
ware.Therefore,an integral part of the ﬁeld is optimization,so that the complex
algorithms could be performed in a reasonable time.For example,if a patient under-
goes an magnetic resonance or ultrasound investigation,it is essential to have the data
processed in several minutes so that a specialist can consult the results with the patient
and propose next steps within a single visit [GLGN
08].In other applications like
user-assisted image segmentation,the data need to be processed real-time [HLSB04].
The most important computational hardware for image processing is Central Pro-
cessing Units (CPU),for their versatility and tremendous speed.According to the
oft-quated Moore’s Law,the CPU’s computing capability is doubled every two years.
However,this trend is now inhibited as the semiconductor industry has reached phys-
ical limits in production.Soon,it will be no longer possible to decrease the size of
transistors,and therefore to increase the clock frequency.Thus,manufacturers seek for
a different ways to enhance the CPU speed.The most signiﬁcant trend is to increase
CPU’s ability to process more and more tasks in parallel,by extending the instruction
pipeline,involving vector processing units,and increasing the number of CPU cores.
In contrast to CPU,Graphics Processing Units (GPU) are recently considered to be
very efﬁcient parallel processing units.While their original purpose was to provide
a graphic output to users,their computing capabilities have been getting more and
more complex and their performance has overcome common CPUs,in terms of both
computing speed (in ﬂoating point operations per second) and memory bandwidth
10] (Figure 1).Besides,their perfor-
mance growth is not limited by current manufacturing technology,so the gulf between
CPU and GPU computational power is still getting wider.The GPU architecture bene-
ﬁt from a massive ﬁne-grained parallelization,as they are able to execute as many as
thousands of threads concurrently.Attempts to utilize GPUs not only for manipulat-
ing computer graphics,but also for different purposes,lead to a new research ﬁeld,
called “General-purpose computation on GPU (GPGPU)”.The graphic cards provide
huge computational performance for a good price.For instance,NVidia GeForce GTX
480,a top desktop graphics card with 480 streaming processors,built on a newFERMI
architecture,is slightly cheaper than the quad-core CPU Intel Core i7-960 (as to July
2010).Furthermore,the recent computer mainboards support up to 4 PCIe x16 slots
Figure 1:A performance gap between GPUs and CPUs [KmH10].
offering space for 4 GPU cards.With multiple slots occupied,a common affordable
single PC can be turned into a supercomputer,literally.
This approach also shows some caveats.First,there used to be no high-level pro-
gramming languages speciﬁcally designed for general computation purposes.The
programmers instead had to use shading languages such as Cg,High Level Shad-
ing Language (HLSL) or OpenGL Shading Language (GLSL) [OLG
08],to utilize texture units.This no more poses an obstacle.In 2004,the Brook
04] and Sh [MDTP
04] programming languages were released as an extension
of ANSI C with concepts from stream programming.In 2007,NVidia released the
ﬁrst version of Compute Uniﬁed Device Architecture (CUDA) [CUD10e] [CUD10c],a
programming model to code algorithms for execution on the GPU.A more general,
non-proprietary framework,developped by Khronos Group,is Open Computing Lan-
guage (OpenCL),which can be used for heterogenous platforms,including both the
CPU and the GPU [Ope10c] [Ope10b].
Second,not all methods are suitable for a massive parallel processing.The GPU
performance is limited by several conditions:a) the ratio of parallel to sequential
fraction of the algorithm,b) the ratio of ﬂoating point operations to global memory
accesses,c) branching diversity,d) global synchronization requirements and e) data
transfer overhead [CDMS
10].Image processing algorithms in general are
good candidates for GPU implementation,since the parallelization is naturally pro-
vided by per-pixel (or per-voxel) operations.Therefore,they naturally fulﬁll the a)
condition,particularly.This was conﬁrmed by many studies examining GPU accelera-
tion of image processing algorithms.Nevertheless,programmers have to keep in mind
Figure 2:GPU performance in single and double precision [Gri10].
also the other conditions in order to keep their implementations efﬁcient.An enor-
mous effort has been recently spent on accelerating image processing on GPU.Many
examples of successful implementations are cited in the following section of the text.
Third,programmers should be aware that graphic cards perform best in a single
ﬂoating point precission.The performance in a double and higher precissions is signif-
icantly worse.For instance,AMD/ATI GPUs provide approximately 5 times smaller
computational power in a double precision compared to a single precision [AMD10],
NVidia GPUs provide even 16 times smaller power [CUD10e] [Tes09].The latter should
change with a release of TESLAcards built on a recent FERMI architecture [Tes10] (Fig-
ure 2).In many applications,this does not pose an obstacle,since a single precision
can be enough,as will be shown in the following section.In some cases,a computation
can be divided into two parts,of which the critical one (in means of precision) can be
executed on CPU,whereas the other one is executed on GPU.In the other cases,a
careful attention needs to be paid to precision,and the users should probably wait for
better GPU architectures.
Fourth,one of the bottlenecks of graphic cards is their relatively expensive,there-
fore small global memory.A common modern GPU provides approximately 4–8 times
smaller global memory than a common CPU.This can pose signiﬁcant problems for
programmers in their effort to implement algorithms on GPU.The amount of pro-
cessed data can be enormous,especially in applications which produce three- or more
dimensional images,such as optical microscopy or computed tomography.Another
example is magnetic resonance imaging,using diffusion tensor images.In this case,
the image domain is relatively small,but the amount of information stored for every
single image element is huge.As a result,the size of the image can exceed the size of
GPU global memory,or,at best,there is enough global memory to allocate the space
for the image itself,but not for auxilliary data structures.
The simple fact,that image size exceeds the amount of available memory,does not
itself pose an obstacle for a potential GPU implementation.There is a large class of
image processing algorithms which can be referred to as methods with local knowledge.
In this case,in order to process each image element only local information about the el-
ement and its closest neighbours is needed.Image tonality adjustation,edge detection
ﬁlters,and linear combination of multiple images are typical examples of this class.
Linear and non-linear ﬁlters with relatively small kernels,as well as dilation and ero-
sion with small structural elements can be also considered members of this class.Due
to nature of local algorithms it is easy to split data into small chunks.Consequently,
these parts can be processed independently,so only little effort has to be made on
problem decomposition,actually.As will be shown in our survey,this is the case of
the most current GPU implementations dealing with huge images.
On the contrary,algorithms,which we refer to as methods with global knowledge,
need to store the entire image.Fourier transform (and fast Fourier transform as well),
Hough transformor wavelet transformare good examples of this class.Many methods,
for instance,convolution and deconvolution with large kernels,are derived fromthese
transforms,and therefore should be also considered global.The Euclidean distance
transform,the watershed algorithm,and generally all segmentation methods based on
a functional minimization are other candidates.In this case,it is necessary to seek for
nontrivial methods to decompose a problem into several parts which are independent
fromeach other,in order to process data on architectures with a small memory.
In this decomposition process,it is important to pay attention to task dependency.
The most desired goal is to split data into independent parts,so that they can be
processed in parallel,in our case,on multiple GPU cores.Also,it may be possible to
deal with huge data that cannot be stored not only in GPU memory but also in CPU
memory.Without decomposition,the computation would have to be performed in a
larger,but much slower external memory.
Intention of Thesis.The main goal of this dissertation thesis should be acceleration
of image processing algorithms with respect to huge data.Since our laboratory,Centre
for Biomedical Image Analysis (CBIA),is focused on optical microscopy,we deal with
large 3-D images;therefore,it is necessary to have tools for processing such data.The
thesis should focus on decomposition of image processing algorithms,especially of
the class of global methods.For implementation we choose the GPU platform,since
it is suitable not only for testing,but also for considerable speedup of computational-
intensive algorithms.Although many papers concerning GPU acceleration were pub-
lished,very few of them dealt with huge images and problem decomposition;there-
fore,lot of questions remain unanswered in this ﬁeld.
2 Report on Current Results
In this section we will give a summary of state of the art in GPU acceleration of im-
age processing algorithms.The section is organized as follows:First,several popular
parallel computing models are brieﬂy described.Second,GPGPU is introduced and
the most important surveys and performance studies are summarized.Third,a survey
of GPU implementations is made,with respect to image processing algorithms.In the
ﬁnal subsection,we focus on implementations dealing with large images.
2.1 Parallel Computing Architectures and Models
2.1.1 NVidia CUDA
As the number of papers published shows,one of the most widely used parallel pro-
gramming models is CUDA,developed by NVidia.It is very easy to use for program-
mers,since it introduces a small number of extensions to C language,in order to pro-
vide parallel execution.Another important features are ﬂexibility of data structures,
explicit access on the different physical memory levels of the GPU,and a good frame-
work for programmers including a compiler,CUDASoftware Development Kit (CUDA
SDK),a debugger,a proﬁler,and CUFFT and CUBLAS scientiﬁc libraries [CUD10e]
From the hardware point of view,NVidia graphics card architecture consists of
a number of so-called streaming multiprocessors (SM).Each one includes 8 shader
processor (SP) cores,a local memory shared by all SP,16384 registers,and fast ALU
units for hardware acceleration of trancendental functions.A global memory is shared
by all SMs and provides capacity up to 4 GB and memory bandwidth up to 144 GB/s
(to July 2010).FERMI architecture introduces newSMs equipped with 32 SPs and 32768
registers,improved ALU units for fast double precission ﬂoating point performance,
and L1 cache.
From the programmer’s point of view,every program consists of two parts,a host
code ran on CPU (called a ’host’),and a kernel—a piece of code ran in parallel on
GPU (called a ’device’).CUDA’s extensions to the C programming language allow
programmer to deﬁne how many threads performing a kernel will be executed on
a device.The number of threads can greatly (up to 10,000) exceed the number of
processing units which execute the program.This paradigm allows programmers to
consider GPU as an abstract device providing a huge number of virtual resources.
Therefore,not only it provides a good scalability for future hardware,it also allows to
hide a global memory latency by very fast thread switching [CUD10c].
The threads executed on GPU form an 1D,2-D,or 3-D structure,called a block
(Figure 3).The blocks are arranged in 1D or 2-D layout,called a grid.Thus,each
thread can be exclusively deﬁned by its coordinates within a block,and by coordinates
of its block within a grid.Each block of threads is executed on a sigle SM;therefore,
all threads of a single block can share data in a shared memory.Block’s threads are
executed in warps of 32 threads each.
A memory hierarchy is introduced in a number of levels (Figure 4).In the lowest
level,there are small,fast,non-shared registers.Then,each SM contains a shared
PDFill PDF Editor with Free Writer and Tools
The number of threads per block and the number of blocks per grid specified in the
syntax can be of type
dimensional blocks or grids can
be specified as in the example above.
Each block within the grid can be
by a one
accessible within the kernel through the built
variable. The dimension of the t
hread block is accessible within the kernel through
Extending the previous
example to handle multiple blocks, the code
becomes as follows.
DA Programming Guide Version 3.0
// Kernel definition
(i < N && j < N)
C[i][j] = A[i][j] + B[i][j];
Grid of Thread Blocks
2: Programming Model
Figure 3:Threads hierarchy in CUDA programming model [CUD10c].
2: Programming Model
DA Programming Guide Version 3.0
As illustrated by
assumes that the
CUDA threads execute on a physically separate
that operates as a coprocessor
running the C program.
This is the case, for examp
execute on a GPU and
rest of the
C program executes on a CPU.
Figure 4:Memory hierarchy in CUDA programming model [CUD10c].
memory,as mentioned above.This memory is also fast and small;it can provide
communication between threads in a block.Another important memory space is a
constant memory,which is read-only.Since it is cached,it usually serves as a fast
storage place for function parameters and constants.A texture memory is read-only
and cached,as well,but it also includes several useful in-hardware features,such as a
number of addressing modes,interpolation,and locality.In the highest level,there is
a large but long-latency global memory space.
2.1.2 ATI Stream
ATI Stream is a parallel GPU architecure,designed for GPGPU computations,for-
merly developed by ATI company,which was acquired by AMD in 2006.The series
of graphics cards are branded as ATI Radeon,ATI FirePro,and AMD FireStream.
Also,low-power cards for mobile devices are developed,such as ATI Mobility Radeon.
Hardware implementation of stream processing units is architecturally different from
NVidia’s stream processors,allowing a high number of stream processing units per
core,L1 cache,and increased double precision performance.However,the cards offer
lower memory bandwidth and processor clocking,so the overall performance is com-
parable with the competitor’s.The GPGPU developers are supplied by ATI Stream
SDK,allowing to develop applications in the high-level OpenCL language,described
bellow [ATI10b].Figure 5 illustrates the relationship of the computing components in
the ATI Streamarchitecture.
OpenCL is a framework for heteregenous platforms,consisting of CPU,GPUand other
architectures.It is an open standard developed by Khronos Group [Ope10a].It in-
cludes a C-based cross-platform language,a platform layer API,a runtime API,and
scientiﬁc libraries.In a context of GPU programming,it also supports an interoper-
ability with the OpenGL standard.Analogous to CUDA,basic execution units are
called kernels.They can be either data-parallel,allowing a ﬁne-grained paralleliza-
tion suitable for GPUs,or task-parallel,providing a coarse-grained parallelization for
CPUs.A memory model is deﬁned in levels of private,local,constant and global
memory space [How10] (Figure 6).OpenCL is supported by the both mainstream
GPU vendors,however studies,such as by Kong et al.[KDY
10],revealed that with
NVidia graphics cards,CUDA implementations generally achieve better performance
than their OpenCL equivalents.
2.1.4 Other architectures
Although graphics cards are probably the most widely used devices nowadays,other
hardware architectures can be exploited for massive parallel computing as well.Be-
sides large expensive CPU clusters,a ﬁeld-programmable gate array (FPGA) are pop-
ular for their conﬁgurability and compact one-chip design,allowing them to be em-
ployed in low-power autonomous devices [BR96].Since FPGAarchitecure is much less
accesible than GPU and requires low-level"close-to-metal"approach,we have decided
Figure 5:ATI Streamsoftware enviroment [ATI10a].
Figure 6:Memory hierarchy in OpenCL programming model [ATI10a].
not to deal with FPGAs in our thesis topic.However,it can be mentioned that FPGAs
can be also used in image processing [LDZZ09] [YNAS09].Also,a research in adapt-
ing CUDA programming model into FPGA is conducted [PGS
09],building a bridge
between GPGPU development and FPGA design.
2.2 General-Purpose Computation on GPU
GPGPUis the technique of mapping general-purpose algorithms to graphics hardware,
exploiting its high computation and data throughput.The performance of graph-
ics cards was employed in many applications,including trafﬁc simulation [CBM
molecular dynamics [Joh07],seismic modelling [KEGM10],n-body simulation [Nyl07],
and image processing [CDMS
10] (Figure 7).For further information,the
website GPGPU.org [GPG] can be recommended as an up-to-date source for both de-
velopers and researchers,including notiﬁcations to recent papers.The OpenVIDIA
framework [FR] is another project focused on the topic,specializing in OpenGL,Cg
and CUDA implementations.
One of the ﬁrst methodical surveys made on GPGPU was made by Owens et al.
05].Since published in 2005,it did not cover recent research based on popular
high-level programming languages,such as CUDA or OpenCL.Not all the informa-
tion,however,became out-of-date,as some paradigms in parallel computing remained
unchanged.A streaming model,as an abstraction for GPU programming,was con-
sidered in the survey.Several fundamental operations were introduced,such as map,
reduce,scatter and gather,streamﬁltering,sort and search.
A map operation applies a given function to every element in a given data stream.
Computing a smaller output data stream from a larger input stream is called a re-
duction.Scatter and gather are two fundamental memory operations,which represent
indirect read and write memory acceses.Stream ﬁltering provides an ability to select
a subset of elements from a stream.Sort and search,two well-known operations,need
very careful attention to be implemented efﬁciently on GPU.
The survey also discussed several data structures that can be used in GPGPU,un-
covering some constraints imposed by GPUfor non-indexable structures,such as prior-
ity queues,sets,and k-d trees.Many results of GPU acceleration were cited,showing
a good effort that had been already made on GPGPU using high-level shading lan-
guages.A revised version of the survey was published in 2007 [OLG
Many performance studies comparing GPGPU and traditional CPU were pub-
lished.A paper by Gong et al.[GLG05] provided a comparison between CPU (In-
tel Pentium 4) and two commercially-succesful GPU platforms (ATI Radeon 9800 and
NVidia GeForce 5950),in a number of real-time image processing applications.The au-
thors used three different shading languages (HLSL,Cg and GLSL) and showed that
both the GPU platforms are competitive,showing signiﬁcant speedups (up to 36 in
the case of ATI and up to 22 in the case of NVidia) over CPU.
A study by Che et al.[CBM
08] examined several common,computationally de-
manding applications,including trafﬁc simulation,thermal simulation,and k-means
algorithm.The speedup achieved by GPU implementations using CUDA was between
6 and 40 ,compared to CPU implementations,using a common commercially-
(a) Fast and stable rendering of water surfaces
(b) A brain cortex segmentation [CLW04].
(c) Calculating electrostatic potential maps in
molecular dynamics [Joh07].
(d) Feature extraction [PSL
(e) A simulation of a galaxy as an N-body system
(f) A 3D reconstruction of a mouse placenta
Figure 7:Applications in GPGPU.
Ryoo et al.[RRB
08] discussed NVidia GPU’s key features and optimization strate-
gies.They studied GPU kernels,implemented in CUDA,for dozen various applica-
tions,including H.264 video encoder,Lattice-Boltzman method,and MRI image re-
construction.They claimed tremendous speedups comparing NVidia GeForce 8800
GTX to 2.2 GHz Opteron 248 (a single-core CPU),even using SIMD optimizations for
CPU implementations.For example,MRI reconstruction algorithms achieved speedup
rates over 100 ,including data transfer overhead.The most of the kernels attained
speedups over 10 ,actually only two of themperformed similarly on CPU and GPU,
in particular,H.264 video encoder and FDTD electromagnetic wave propagation sim-
ulator.The basic optimization principles described in the paper are maximizing the
ratio of ﬂoating-point operations,hiding global memory latency by high streaming
processor’s occupancy,and balancing the usage of global memory,shared memory,
and registers,in order to achieve optimal memory bandwidth.
Another survey of many different CUDA implementations was published by Gar-
land et al.[GLGN
08].The topics of interest,in this case,were molecular dynam-
ics,numerical linear algebra,medical imaging,ﬂuid dynamics,and seismic imaging.
The most impressive results were attained in molecular dynamics.In the former ap-
plication,parallelization was naturally provided by mapping each atom to a single
thread.A CUDA application,called Highly Optimized Object-Oriented Molecular Dy-
namics (HOOMD) was introduced and compared to another parallel software,called
LAMMPS,showing that the former one executed on a single GeForce 8800 GPU per-
formed similarly to the latter one executed on a cluster of 32 CPU cores.Another com-
parison was made using a popular application Folding@Home;in this case,GeForce
8800 outperformed a dual-core CPU by rate of 100 .However,the results in other
applications were not that impressive.For example,in linear algebra,a SGEMMdense
matrix-matrix multiplication kernel by Volkov and Demmel was compared to Intel’s
Math Kernel Library (MKL).Testing the graphics card mentioned above,and Intel
Core2 Quad processor,the GPU implementation performed 3 faster.The same re-
sults were achieved for matrix factorization problem,in particular LU,Cholesky,and
In contrast with many previous studies,a recent paper by Lee et al.from Intel
10] shows that with a careful optimizing of CPU implementation,
still the CPU can be succesful in many applications.They perform a comparison of
14 various implementations,showing speedups from 0.5 to 15 (GPU over CPU)
which is signiﬁcantly less than in many previous studies.Their experiment was made
with Intel Core i7 and NVidia GTX280.
The paper presents not only results of comparison,but also examination of the
results,emphasising pros and cons of both architectures.The GPU beneﬁts from
higher global memory bandwidth and ﬂoating-point operation performance,hardware
scatter-gather support,acceleration of transcendental (e.g.exponential and trigonomet-
ric) functions,and fast texture lookup capability.As a result,kernels,such as bilateral
ﬁlter,GJKcollision detector,SAXPY (basic linear algebra task ax +y),and Lattice Boltz-
mann method are accelerated by rate of 5 and higher on GPU.On the other hand,the
weaknesses of GPU are non-cached global memory (which does not stand for a new
FERMI architecture),reduction overhead,and synchronization overhead.Thus,radix
sort and constraint solver perform better on CPU than on GPU;and kernels,such as
Monte Carlo or tree index search,are performed on both architectures similarly.
It can be pointed out that the results should slightly change using more recent
graphics hardware,since NVidia has already released FERMI architecture by that time.
Also,more impressive results can be achieved using multiple graphics cards in paral-
lel.Nevertheless,it has been shown that the programmers indeed should be careful
judging their CPU and GPU implementations;and that many tremendous speedups of
100 and more,declared in GPGPU,should be treated with caution.
2.3 Image Processing Acceleration
Image processing is a wide ﬁeld,including many various applications.It involves
many speciﬁc algorithms.Among others,these are Fourier,Hough,Wavelet,and
Radon transform [GW02] [SM06],matrix multiplication,linear equation solvers,and
principal component analysis [Pra01] [Jäh05],partial differential equations (PDE) solvers
[Kim05],sorting and searching algorithms [Pra01] [GW02],statistical methods [Fra04],
morphological operations [Ser83] [Mar96],and graph algorithms.Many of these meth-
ods are computational intensive and suitable for parallelization [BFRR01].
Castaño-Díez et al.[CDMS
08] presented a detailed performance evaluation of
image processing algorithms,with respect to their GPU implementations.They tested
a number of particular functions frequently used in image processing,such as a fast
Fourier transform,a matrix multiplication,and several linear and non-linear ﬁlters.
They also implemented more complex image processing algorithms,such as weighted
back projection used for 3-D image reconstruction,and principal component analysis
used for classiﬁcation of large data sets.The GPU implementations were written in
CUDA and Cg shading language and claimed speedups rates of at least 10 in the
most cases,comparing Intel Dual Core CPU and NVidia Quadro FX5600.
Several GPU computing toolboxes for MATLAB were developed,such as Jacket
[AE10] and Parallel Colt [WN].Kong et al.[KDY
10] developped an open-source
GPU library for the MATLAB Image Processing Toolbox.Mapping 12 various func-
tions from IPT on GPU,using both CUDA and OpenCL parallel models,they studied
which algorithm and hardware features inﬂuence the overall performance.Based on
algorithm characteristics,they divided the functions into four categories:data inde-
pendent (e.g.image intensity adjustment),data sharing (edge detection),algorithm
dependent (Euclidean distance transform),and data dependent (dither).The functions
of the ﬁrst category map naturally on a GPU and achieve high kernel performance.
They are usually limited by memory bandwidth and data transfer overhead.The data
sharing algorithms are excellent candidates for massive parallelization.For each im-
age element,the information about its neighbours is needed,therefore a careful opti-
mization is required in order to hide global memory latency,by means of eliminating
data reuse.With such optimization strategies,tremendous speedups can be achieved.
The algorithm-dependent functions often require to be fundamentally revised,with
respect to architecture differences.Still,they are good aspirants for GPU acceleration.
The dither function,falling into the last category,performed poorly on GPU,due to
large fraction of sequential processing,large synchronization overhead,and low ratio
of computations to memory accesses.
The imaging process can be divided into several phases.First,an image is acquired
by a device,digitized,and reconstructed.This is called image formation.Then,a pre-
processing is made,applying various image enhancement techniques.This includes
tone mapping,image restoration,and geometrical methods.Finally,image analysis is
performed,extracting desired information fromthe image,by means of morphological
processing,edge detection,image segmentation,feature extraction,shape analysis,
and image registration [Pra01].In the following text,we will discuss particular image
processing methods and their GPU implementations.
2.4 Image Formation
Basically,most of imaging techniques use projection from 3-D space (real world) into
2-D space (imaging device).In many applications,these two-dimensional images are
used for further processing,without special requirements for image formation.In
other applications,such as magnetic resonance imaging (MRI),computed tomography
(CT),and 4Pi confocal microscopy [CC74] [HS92],a number of acquired images is used
for reconstructing a complex 3-D image.This leads to advanced approaches to image
In [XM05],Xu and Mueller implemented three 3D reconstruction algorithms from
CT,namely Feldkamp ﬁltered backprojection,simultaneous algebraic reconstruction
(SART),and expectation maximization (EM),on graphics hardware.From 80 projec-
tions,they obtained reconstructions of 128
pixels.The quality of GPU results was
comparable to those obtained from CPU,achieving approximately 10 speedup.In
[XM09],Xu and Mueller introduced GPU implementation of the OS-SIRT iterative re-
construction algorithm (Figure 8).Furthermore,they implemented bilateral ﬁlters,for
efﬁcient regularization of reconstructed results.For the latter operation,they achieved
tremendous speedups of more than 100 ,using OpenGL Shading Language with
NVidia 8800 GT graphics card.They approved the OS-SIRT method to outperform
SART for data with high level of noise.
Castaño-Díez et al.[CDMS
08] presented three different implementations of SART,
two of them using CUDA,and one with a texture-based approach using Cg.All the
implementations achieved approximately 50 speedup,compared to their CPUequiv-
alents.The Cg implementation performed the best,however,writing and debugging
it required considerably higher effort than in CUDA,as noted by the authors.The
GPU-accelerated weighted back projection,used for 3D reconstruction from 2-D elec-
tron micrographs,was tested,as well.This included low-pass ﬁltering of the acquired
images,rotation,Fourier space weighting,and backprojection into a 3D image.They
attained speedups up to 10 on graphics hardware.
Stone et al.[SHT
08] accelerated advanced MRI reconstruction algorithm,with
respect to optimal reconstruction from non-Cartesian scan trajectories.They imple-
mented anatomically constrained reconstruction algorithm by Haldar et al.[Hal08].
The key part of the algorithm was the conjugate gradient linear solver,designed to
perform an iterative matrix inversion.As the number of elements in matrices was
equal to the square of voxels in the reconstructed image,the matrices contained tril-
Figure 8:The iterative tomographic reconstruction of medical images with bilateral
lions of complex-valued elements for images of 128
had either a sparse,or a convolutional structure,allowing programmers to handle
with them in a compressed format,or in a Fourier domain,respectively.The authors
tuned their algorithm,comparing seven versions with various optimizations.Their
GPU best implementation was able to compute the MRI reconstruction within less
than two minutes,while the CPU implementation completed in 23 minutes.Using
four NVidia Quadro FX 5600 graphics cards,a multi-GPU implementation was tested
as well,reducing the overall time to less than a minute.
In ultrasound imaging,inverse-scattering algorithms are used for generating 3D
volumetric images of a breast.These methods are time consuming,however,specialists
would like to discuss results with their patients within the same visit.In Garland et al.
08],GPU implementations of particular procedures used in these algorithms
were discussed,especially 2D convolution by fast Fourier transformand CAXPBY rou-
tine,which computes several basic operations from linear algebra.In the algorithm’s
run,over 50 million 2D FFTs needed to be computed,accounting for over 75 percent
of the computation time.Implementing a batched 2D FFT kernel,Garland et al.made
the GPU convolution approximately 16 times faster than the CPU implementation.
An iterative tomographic reconstruction was studied by Jang et al.[JKDP09].Both
forward and backward projection steps were implemented using CUDA on NVidia
Tesla S870.This graphics hardware consists of four independent Tesla C870 GPUs,
therefore,a multi-GPU approach was examined.For the forward projection,they
Figure 9:Examples of HDR images processed by tone mapping [GWWH05].
achieved 35 speedup with a single GPU and 70 speedup with all four GPU cores,
compared to Intel Core 2 Duo CPU.For the back projection,GPU performed even
better.The multi-GPU performance was signiﬁcantly degraded due to CPU thread
synchronization overhead,as each GPU core had to be driven by an individual CPU
thread.However,this could be caused by the operating system’s thread scheduling
policy,as stated by the authors.Using the single ﬂoating-point precision,the results
showed negligible difference between the CPU and GPU reconstructions.
2.5 Image Enhancement
2.5.1 Tone Mapping
Tone mapping is the process of mapping pixel intensity values with high dynamic
range (HDR) to a smaller dynamic range for the purposes of visualization.As recent
display technologies still provide limiting dynamic range,tone mapping is an impor-
tant part of image visualization.It also includes brightness and contrast enhancement.
In this context,simple histogram-based adjustments can be also considered a part of
tone mapping.As these methods involve a large number of pointwise operations,
they can be parallelized on GPU very naturally;however,the speedup may be limited
by global knowledge required (see Introduction) in a particular algorithm [BFRR01]
Drago et al.[DMAC03] designed a fast HDR tone-mapping algorithmand achieved
a modest speedup,comparing C implementation on Pentium 4,and OpenGL imple-
mentation on ATI Fire GL X1.They tested their algorithm on common real images,
such as a photograph of the Standford memorial church.Another method,tested
on digital photographs,was published by Goodnight et al.[GWWH05] (Figure 9).
They implemented well-known algorithm of Reinhard et al.[RSSF02] and declared
approximately 10 speedup using an OpenGL implementation,using AMD Athlon
1800+ CPU with ATI Radeon 9800 Pro graphics card.Raspe and Müller presented
an interactive tone-mapping framework for medical volume data [RM07].They built
the cross-platformprogramming enviroment called"Cascada"using OpenGL Shading
Language (GLSL).Their results revealed tremendous speeup factors up to 250 ,tested
on Intel Core 2 Duo and NVidia GeForce 8800 GTS.However,the performance was sig-
niﬁcantly degraded by data transfer overhead.Therefore,the algorithms worked best
when avoiding data transfers.For example,user interface for tuning parameters of
tone mapping maintained real-time performance.
2.5.2 Image Restoration
Image restoration aims to remove undesired artifacts in images,including blurring
introduced by an optical system,a motion blur,and a noise from electronic sources.
Thus,it is often an integral part of image formation.Removing blur leads to decon-
volution techniques;eliminating noise usually involves linear and non-linear ﬁlters.
However,for real images,use of advanced approaches is inevitable,taking both degra-
dation features into account.
Gong et al.[GLG05] implemented a number of simple image ﬁlters,including
mean,Gaussian and median ﬁlter,using the HLSL shading language.They compared
two graphics cards (ATI Radeon 9800 XT and NVidia GeForce FX 5950 Ultra) with the
Intel Pentium 4 processor.For linear ﬁlters,they achieved approximately 4 speedup
with the ATI GPU and almost no speedup with the NVidia GPU.For nonlinear ﬁlters,
such as the median,the ATI card attained the same processing times as the CPU,and
the NVidia card was 6 slower.It was shown that nonlinear ﬁlters indeed perform
worse than linear ﬁlters,as they involve many comparing operations,which are not
very suitable for paralelization of graphics hardware.In this work,only very small
3 3 ﬁlter kernels were discussed.Much larger kernels would probably give more
precise results,eliminating the effect of overheads.
In Castaño-Díez et al.[CDMS
08],both linear and median ﬁlters were accelerated,
showing good performance,especially for large 3D images.For small images,such as
128 128 pixels,only a modest speedup was attained.This could be expected,due to
some overheads,however,the authors did not comment this issue in detail.Kong et
10] implemented MATLAB function conv2,which provide a convolution of
2D images.With image sizes of over million pixels,the achieved kernel speedups were
in order of hundreds,however,the overall speedups were approximately 10 ,taking
into account data transfers.A number of linear ﬁlter implementations can be found in
CUDA SDK [CUD10d].
Among advanced restoration methods,Richardson-Lucy deconvolution algorithm
[RIC72] [Luc74] is one of the best known and widely used.Domanski et al.[DVW09]
compared a CPU implementation,using the FFTW library [FJ09],and a GPU im-
plementation,using the CUDA framework and the CUFFT library [CUD10b].Two
approaches were discussed,deconvolution in the spatial domain and in the Fourier
domain.Even for relatively small sizes of a ﬁlter kernel,the FFT-based approach out-
performed the spatial-domain approach.Comparing the CPU and the GPU platforms,
i.e.the NVidia GTX 260 graphics card and the Intel Xeon Quad-Core processor,the
GPU implementation was 8 faster for 2D images and 4 faster for 3D images.
Quammen et al.[QFTI09] presented a performance study of three 3Ddeconvolution
(a) Original image
(b) Blurred and noisy image
(c) Reconstructed image
Figure 10:Reconstruction of an astronomical image [SZZ10].
algorithms,namely Wiener [Wie64],Jansson-van Cittert [SN06],and Richardson-Lucy,
on both multi-core CPUs and GPUs.As the FFT-based approach was preferred,both
the FFTW and the CUFFT libraries were employed.The scalability of algorithms on
CPU was studied,measuring performance on various numbers of CPU cores,ranging
from 1 to 16.The results revealed that the CPU platform,consisting of 8 dual-core
AMD Opteron nodes connected by a Hyper-Transport network,performed best in the
case of 8 CPU cores.For a larger number of cores,a sharp drop-off in scalability was
revealed,probably caused by some implementation features in FFTW library.Then,
the 8-core CPU and a single-core GPU,NVidia Quadro FX 5600 in particular,were
compared.The CPU performed 2 faster for the Wiener ﬁlter,and approximately
1.5 faster for the other ﬁlters.As the time complexity of all the algorithms is strongly
depended on FFT performance,other FFT libraries for GPUshould be considered,such
as [NOEM08] and [GLD
08].For further information,refer to section 2.6.6.
Expectation Maximization (EM) method,as described in [SV82],Scaled gradient
projection (SGP) [BZZ09] and Total Variation (TV) regularization [ROF92] are examples
of optimization techniques,used for edge-preserving removal of noise.Seraﬁni et al.
[SZZ10] proposed a GPU version of both SGP and EM using the CUDA framework,
and the CUFFT [CUD10b] and CUBLAS [CUD10a] libraries (Figure 10).They used
double precision in the CPU implementation,tested on AMD Athlon X2 Dual-Core,
and single precision in the GPU implementation,ran on NVidia GTX 280.According
to their results,SGP largely outperformed EM in means of the time consumption,
obtaining the same image quality.The speedup achieved by GPU was approximately
20–30 for both methods.Moazeni et al.[MBS09] described a GPU implementation
of TV regularization for denoising of diffusion tensor images (DTI),used in MRI.Their
algorithmrequired less than 2 minutes to process a 3-Dimage of 128
voxels on NVidia
Quadro FX 5600.The same algorithmneeded more than 3 hours to be completed on a
2.5.3 Geometrical Methods
Geometrical modiﬁcation of images is very common image processing operation,in-
cluding basic transformations,such as translation,rotation,and scaling;and advanced
procedures,such as nonlinear warping and perspective transformation.These algo-
rithms are very naturally accelerated on graphics hardware,since they actually repre-
sent one of the most typical tasks for GPUs.Geometrical methods are often an integral
part of image formation and image registration,such as in [SKSF07] and [MOOXS08].
For further information on the topics,refer to sections 2.4 and 2.6.6.
In one of early works,Ren et al.[RPZ02] described hardware acceleration of the
Elliptical weighted average (EWA) surface splatting method,used for high-quality ren-
dering of point-sampled 3-Dobjects.In the paper,an object space formulation was pre-
sented,and a GPU implementation was developed,using DirectX-supported graphics
hardware.Kelly and Kokaram [KK04] proposed an algorithm for motion estimation
in video sequences.The key part of the algorithm was fast image interpolation,us-
ing textures on graphics hardware.Fan et al.[FEK
05] described an interpolation
method called natural neighbor interpolation,based on Voronoi tessellation.Strengert
et al.[SKE06] applied bilinear texture interpolation,implemented on modern graphics
hardware,to accelerate pyramid methods,which are popular especially in fast im-
age processing.Kraus et al.[KES07] proposed the edge-directed image interpolation
method for image up-sampling,which avoided ringing artifacts,and preserved smooth
and sharp edges.The graphics hardware was exploited for the computation,using
OpenGL.Zhu et al.[ZLB
09] proposed Spherical Depth Image,a spherical represen-
tation of image-based models,adopted to image warping.A real-time performance
was achieved,implementing the algorithmon a graphics hardware.
Terrain and surface rendering is a typical application for geometrical methods.
Botsch and Kobbelt [BK03] proposed a fast framework for point-based geometry.The
algorithmwas implemented on the NVidia GeForceFX 5800 Ultra graphics card,using
OpenGL.The splat ﬁltering was used for smoothing textures,achieving more eye-
appealing results.Dachsbacher and Stamminger [DS04] presented a view dependent
level-of-detail approach for interactive terrain rendering in real-time.During render-
ing,procedural geometric and texture detail adding was accelerated by graphics hard-
ware.Schneider et al.[SBW06] described a method for creation and rendering of
realistic landscapes.The real-time fractal terrain synthesizer,combined with WYSI-
WYG user interface,was built and implemented on NVidia GeForce 7800 GTX using
Levin et al.[LDS05] implemented the thin plate spline (TPS) warping algorithm
[Duc76] using GLSL.Warping was applied on CT-based volumes of 512512 173
voxels,using 92 user-deﬁned displacement vectors.Xiao et al.[XCZ09] proposed a
bilinear warping method for 2D CT reconstruction,accelerated on graphics hardware.
Using OpenGL and a HP W4000 PC workstation having 3DLabs Wildcat 6110 graphic
card,the GPUimplementation achieved 20 speedup for forward-projection stage and
15 speedup for whole reconstruction.
2.6 Image Analysis
2.6.1 Morphological Processing
Mathematical morphology is based on analysis and processing spatial structures in
images.The basic concepts of this theory trace back to set theory,integral geometry
and lattice algebra.Morphological methods can be used both for image pre-processing
and for image analysis.For example,opening and closing can be adopted for smooth-
ing contours of objects or eliminating small holes in objects.In the other application,
image skeletonizing can be used to describe structure of objects.Watershed transform
is one of the well known approaches to image segmentation.For information about
GPU-based watershed implementations,refer to section 2.6.3.
Dilation and erosion are two fundamental morphological operations.Due to their
nature,they can be very simply implemented on graphics hardware,as presented in
[HB08] or [KDY
10].Another GPU implementation of dilation,erosion,and closing
can be found in [ESA05].In this paper,application of morphological methods to pre-
processing in medical imaging was studied.Sugano and Miyamoto [SM08] proposed
parallel implementation of several morphological operations,however,they did not
use graphics hardware,but the high-performance Cell Broadband Engine processor.
Dortmont et al.[vDvdWT06] implemented a method for computing distance trans-
forms and skeletons of 3D objects on graphics hardware.Rong and Tan [RT06] pro-
posed a fast GPU-based jump ﬂooding algorithmto compute an approximations of the
Voronoi diagram and the distance transform.Cao et al.[CTMT10] described a CUDA
implementation of a parallel banding algorithmfor computing the exact Euclidean dis-
tance transform (EDT),which can be further used for performing such morphological
operations as dilation,erosion,opening,and closing.
2.6.2 Edge Detection
Edges,i.e.changes or discontinuities in image luminance or color values,represent
important regions in image.According to their character,edges are often detected by
derivative-based operators,for example,ﬁrst-order derivative methods (based on gra-
dient or directional derivatives) and second-order derivative methods (e.g.Laplacian
operator).Edge detectors can be often implemented by means of convolution,either
by a basic convolution,or by a separable convolution,as many edge-detecting ﬁlters
have a separable kernel.In the case of large unseparable kernels,a convolution in
Fourier domain can be performed,as described in [F
C06].CUDA implementation of
convolution methods was described by Podlozhnyuk [Pod07].Other approaches to
edge detection are based on mathematical morphology [YQ08] or on active contour
models [CV01].An example GPU implementation of the latter can be found in [SA09].
Canny is one of widely used edge detectors,based on ﬁrst-order derivatives.Asim-
ple CUDA implementation of Canny operator is described in [Fun05].OpenGL and
Cg versions were presented by Ruiz et al.[RUG07].A more detailed study on Canny
operator was presented by Luo and Duraiswami [LD08].The algorithm was imple-
mented in CUDA and subsequently integrated in MATLAB,achieving approximately
3 speedup over an optimized CPU implementation in the OpenCV library,tested on
Intel Core 2 Duo 6600 and NVidia GeForce 8800 GTX.Kong et al.[KDY
GPU-accelerated procedure for MATLAB,performing the Prewitt edge detector.They
declared more than 1000 kernel speedup over the edge function implemented in the
MATLAB IPT,without taking data transfers into account.
Edge-detecting algorithms can be also included in image segmentation or feature
extraction methods.For further information,please refer to sections 2.6.3 and 2.6.4.
Figure 11:3D segmentation of ﬁshes in acoustic underwater images [SA09].
2.6.3 Image Segmentation
Image segmentation aims to recognize objects in images.There are many approaches to
image segmentation,namely pixel-based (e.g.thresholding),edge-based (derivative-
based approaches),region-based (split-and-merge and region-growing),and model-
based (Hough transform).Segmentation by global optimization leads to more complex
regularization techniques [Jäh05].Mathematical morphology introduces a speciﬁc cat-
egory of segmentation algorithms,for example,a watershed transform[BM91].
Many examples of early work on GPU-based segmentation methods can be found
in comprehensive studies by Hadwiger et al.[HLSB04] and Owens et al.[OLG
We will brieﬂy mention some of them.Diewald et al.[DPRS01] presented an early
GPU implementation of anisotropic diffusion models.Sherbondy et al.[SHN03] de-
scribed GPU implementation of isotropic diffusion based on the Perona-Malik model,
with an application to 3-D volume medical data.Lefohn et al.[LCW03] used a stream-
ing approach on a GPU to accelerate a level set method.They proposed an efﬁcient
memory management scheme using 2-Dtextures;and achieved 15 speedup.Further-
more,in [CLW04],they presented a new tool,called GIST,for interactive GPU-based
Sharma and Anton [SA09] proposed a level set method for 3D reconstruction of
acoustic underwater images (Figure 11).Their approach was based on level set evolu-
tion using Mumford-Shah segmentation functional,with a special view to noise sup-
presion.They implemented a number of procedures in CUDA,including signed dis-
tance transform,weighted median ﬁlter,and PDE solver.As the acoustic images from
marine surveys can often result in gigabytes of data,the algorithm was optimized in
means of memory usage.Due to large resolution,the images were divided into small
volumes of size 128
voxels,stored into 3D textures,and processed independently.
They were able to complete the computation in one minute,using the mobile GPU,
namely NVidia GeForce 8600MGT.
An interesting application of segmentation can be found in Zitnick et al.[ZKU
In this paper,a method for interactive control of viewpoint in video sequences was pro-
posed,combining synchronized streams frommultiple cameras.As a result,user could
replay video from a new viewpoint,continously change the point of view in dramas,
and create"freeze frame"visual effects.Their segmentation method was based on color
matching.For more convenient results,a variant of Perona and Malik anisotropic diffu-
sion was used to reduce noise in images.Their GPU implementation showed real-time
performance for data stored in the graphics hardware’s memory.
The watershed transformviews a greyscale image as a relief ﬂooded by water.Once
the particular ﬂooded regions,called catchment basins,merge,the lines of merging are
marked as watershed lines,i.e.the segmentation lines.The major issue with the stan-
dard watershed is over-segmentation.Therefore,several advanced approaches were
designed to address the problem.As the watershed is a global operation,it is fairly
challenging to be implemented on graphics hardware.Several implementations can
be found,such as [PGX08] and [KP08].In the former work,the multi-degree immers-
ing watershed algorithm was written in CUDA.Comparing NVidia GeForce 8500 GT
graphics card with Pentium4,more than a double speedup was achieved.In the latter
work,a cellular automaton was designed to implement a number of graph algorithms
to perform the watershed transform.Compared with the toboggan-based watershed
implemented on CPU,the GPU implementation attained more than a double speedup,
as well,using Intel Xeon and ATI Radeon X1950 Pro.
Graph algorithms can be used not only in the watershed transform,but also in
graph-cuts approaches.These methods are succesful competitors to level sets.Graph
cuts can be used e.g.for Mumford-Shah functional computation [GA09].They can
perform fast,however,they require a lot of memory.Therefore,they are difﬁcult to be
applied on large images,especially when implemented on GPU.CUDA implementa-
tion,achieving a 10 speedup over a CPU implementation,can be found in [VN08].
Harish et al.[HN07] implemented several fundamental graph algorithms in CUDA,
including breadth ﬁrst search,single source shortest path,and all-pairs shortest path.
They achieved a signiﬁcant speedup (up to 50 ) for artiﬁcial graphs,however,for
real-world graphs with small per-vertex degree,the performance of GPU was compa-
rable to CPU.Barnat et al.[BBB
C10] proposed a CUDAimplementation for computing
strongly connected components of a directed graph.
2.6.4 Feature Extraction
An image feature is a representative characteristic or attribute of an image.The goal
of feature extraction is to transform input data,that can be very large,into a reduced
set of information,which is used for consequential analysis and classiﬁcation.Ampli-
tude features are used for extracting information,contained in image amplitude,i.e.
luminance,spectral value,or other units.Measures describing an image histogram,
such as mean,standard deviation,skewness,or energy,often aid in amplitude fea-
ture extraction.Image transforms,for example Fourier,Haar,and Hough transforms,
can be employed to detect shape-based features.Texture-based approaches include
edge detectors,autocorrelation methods,Gabor ﬁlters,dependency matrix methods,
singular-value decomposition (SVD),and many others [Pra01].For the classiﬁcation of
image data,principal component analysis (PCA) and related techniques are used.
One of the methods,used not only for feature extraction,but also in other appli-
cations,is the wavelet transform.Hopf and Ertl [HE00] proposed an early OpenGL
implementation of the transform.Measured on a Silicon Graphics Octane with R10000
195MHz processor and a MXE graphics pipe,they achieved a signiﬁcant speedup for
Figure 12:Feature extraction in a sunﬂower image [CVG08].
large images.Garcia and Shen [GS05] described a Cg implementation of 3D wavelet
reconstruction,and evaluated the performance on the Intel Xeon processor and the
NVidia Quadro FX 3400 graphics card.For a visible woman volume dataset,the GPU
achieved 6 speedup for Haar wavelets and 7 speedup for Daubechies.Another
OpenGL implementation of the wavelet transform,applied to wavelet-based geomet-
ric design,stylized image processing,texture-illuminance decoupling,and JPEG2000
image encoding,was proposed by Wong et al.[WLHW07].
Fung et al.[FM05] implemented a number of feature detection methods in their
OpenVIDIAlibrary.The algorithmwas designed to compute unique appearance-based
feature vectors,and to track features in video sequences.A detailed description of fea-
ture vectors computing can be also found in [Fun05].Ruiz et al.[RUG07] accelerated
the Hough transform for edge and circle feature detection,using OpenGL and Cg
shading languages.Castaño-Díez et al.[CDMS
08] accelerated the PCA method on
graphics hardware,using CUDA.As the algorithmincludes a number of linear algebra
operations,such as a covariance matrix computation,the CUBLAS library was used to
implement such procedures.They achieved up to 10 speedup for classiﬁcation of
data fromelectron micrographs.
Cornelis and Van Gool [CVG08] presented a fast CUDA implementation of feature-
detecting method,called Speeded Up Robust Features (SURF) [BTVG06] (Figure 12).
With respect to optimization of data layout in GPU texture memory,a scale-space
pyramid creation was proposed.Feature descriptor matching was performed using
different criteria,such as the sum of squared differences (SSD) or the dot product
(DOT).In the tests,performed on images of 640 480 pixels,the framerate of over 100
fps was attained on NVidia GeForce 8800 GTX and 35 fps on mobile graphics hardware,
namely NVidia Quadro FX 1600M.
Gipp et al.[GMH
09] proposed a GPU-based algorithm for computing the Har-
alick’s texture features for biological applications.An optimized approach to a co-
occurence matrix generation was introduced,in order to save a memory space.The
GPU implementation was written in CUDA and ran on the NVidia GeForce 280 GTX
graphics card,to perform32 faster than an optimized software version,executed on
an Intel Core 2 Quad machine.Ujaldon [Uja09] used CUDA to implement computa-
tion of Zernike moments on NVidia GeForce 8800 GTX,claiming 5 speedup over the
best existing CPU implementation.
2.6.5 Shape Analysis
Shape-analysing techniques are useful tools for classiﬁcation of objects within an im-
age,including both qualitative and quantitative approaches.Many shape properties
can be described,such as topological attributes,size measurements and other geomet-
ric attributes,shape orientation descriptors,spatial moments,and Fourier descriptors
In shape analysis,distance transforms,e.g.the Euclidean distance transform(EDT),
are often employed for measurements of geometric characteristics.A number of GPU
implementation of EDT can be found in literature,for example in [RT06],[vDvdWT06],
and [CTMT10].Coeurjolly [Coe10] described several methods of volumetric analysis,
based on distance transforms.In particular,separable approaches were discussed,in
order to design high-performance parallel algorithms.
Peterson et al.[PDHW10] proposed a particle systemfor triangular meshes,which
can be used in statistical shape analysis applications.Considering a number of optimal-
izations,both naive and tuned CUDAimplementations were described.The optimized
GPU version of the algorithm attained up to 300 speedup over the CPU version,
when evaluated on NVidia GeForce GTX 260 and Intel Core 2 Duo.The method was
tested on both synthetic and real data,demonstrating similar results.
2.6.6 Image Registration
Geometrical alignment of images is called image registration.It can be used either for
a fusion,in the case the images contain the same scene but different information,or
for stitching images together,in order to get one large mosaic picture.Basically,there
are two groups of image registration methods,area-based and feature-based.The ex-
amples of the former category are correlation,mutual information,and optimization
methods,in the latter category there are methods using invariant descriptors,relax-
ation methods,and wavelets [Zit03].
One of the area-based registration methods is based on the gradient ﬂow,where a
constant brightness in the image is assumed.Thus,the motion corresponds to bright-
ness changes and motion vectors can be calculated frompartial space and time deriva-
tives of image brightness.Strzodka et al.[SDR03] [SDR04] implemented the multi-
scale gradient ﬂow method using DirectX9 graphics hardware,in particular,NVidia
Figure 13:Stitched panoramic view of the Durban Coastline [DDdV10].
GeForceFX 5800.They were able to register fully distorted images,with resolution of
513 513 pixels,within 10 seconds.In [SG04],Strzodka et al.presented a system for
real-time motion evaluation and visualization on 640 480 25 Hz image sequences.
Duvenhage et al.[DDdV10] proposed a GPU implementation of the Lucas-Kanade
registration algorithm,using GLSL (Figure 13).Their algorithmwas applied to provide
3-D stabilisation of the ocean-based wide area surveillance system.In their multi-
resolution approach,the image pyramid is created and the registration is performed in
the lowest resolution ﬁrst,and subsequentially reﬁned in higher resolutions.Dealing
with images of 4096 1000 pixels,the implementation used regions of interest (ROI)
of 256 256 pixels or 512 512 pixels for image registration.Using NVidia GeForce
GTX285,a real-time performance (30 frames per second) was achieved.
Classical representatives of the area-based methods are correlation methods,based
on Fourier shift theorem.Thus,they can be efﬁciently computed in a Fourier domain,
using the fast Fourier transform;on the other hand,they can be used only in the case
of simple transformations,such as shift,rotation,and scaling.Acceleration on GPU is
therefore limited by an efﬁcient implementation of FFT algorithm.The CUFFT library
by NVidia provides framework for computing FFT on GPU [CUD10b].However,the
speedup is bounded by memory bandwidth and data transfer overhead.Also,since
the FFT is sensitive to accuracy,and the graphics hardware performs best in single
precision,programmers have to pay attention to error rates.Nukada et al.[NOEM08]
proposed a custom kernel for 3-D FFT,written in CUDA,claiming 3 faster perfor-
mance compared to CUFFT.However,they assumed image sizes to be power of two,
therefore,the performance for general image sizes is unknown.A more general,well-
described implementation was presented by Govindaraju et al.in [GLD
2–4 speedup over CUFFT library.
2.7 Decomposition of Large Images in Image Processing
In certain applications,namely MRI,CT,and confocal microscopy,huge 3-D images
need to be acquired and reconstructed.In particular,this can mean up to gigabytes
of data per a single image.Even with 2-D images,excessively large resolution can be
achieved,such as in Earth observation (satellite or panoramic views) or in pathology
(images of tissues).For such data,image processing is difﬁcult and requires advanced
approaches,with respect to optimizations of memory usage.This is particularly true
for GPU implementations,since graphics hardware have relatively small memory,as
discussed in Introduction.
Sometimes,the solution is to simply buy a card with larger memory.For example,
in [CTMT10],the algorithm is tested on NVidia GeForce GTX 280 with 1 GB of video
memory for 2D images,and on NVidia Tesla C1060 with 4 GB of memory for larger
3D images.It can be also mentioned that the latter card achieves a slightly slower
performance,due to a lower memory bandwidth.However,this is not a versatile
solution,indeed.While the size of memory space on graphics hardware is constantly
increasing,so does the resolution of acquiring devices,introducing a"race"between
image and memory sizes.Furthermore,high-end graphics hardware is not always
affordable for every user.Thus,it is advisable to search for optimizations of image
processing methods,in order to be able to decompose images into smaller parts.
A convolution and deconvolution are typical examples of operations,where a data
decomposition can be performed in a straitghforward,but sub-optimal manner.Due to
convolution properties,the (de)convolved image can be divided into arbitrarily small
parts,but all the sub-images have to be extended with neighbouring pixels of at least
size of the ﬁlter kernel (PSF),as described in [TH78].Thus,a lot of redundant com-
putation needs to be performed,proportionaly to the PSF size.This approach was
succesfully used in [BRHM96],to compute spatially-variant deconvolution of Hubble
Space Telescope images.
In MRI and tomographic recontructions,a division of dataset can be used to per-
form computation on multiple processing units,thereby accelerating time-consuming
08],a multi-GPU implementation of MRI recontruction was
tested on four NVidia Quadro graphics cards.The key part of the algorithmwas time-
consuming computation of excessively large vector.Therefore,a simple data splitting
could be used.Slightly sub-linear speedup was achieved,due to an overhead required
to arrange the data.Jang et al.[JKDP09] proposed an algorithm for iterative tomo-
graphic reconstruction on multiple GPUs,namely on NVidia Tesla S870,which con-
sisted of four GPUs.A detailed description of multi-GPU utilization,however,was not
introduced in the paper.
Another example of decomposition can be found in [SA09].In this paper,Sharma
and Anton proposed a level set method for 3D reconstruction of large acoustic un-
derwater images.With multi-beam echo sounders,a series of 2-D images is acquired,
forming a 3-D volume of nearly 100 million voxels.These data use about 470 MB of
memory along with the same amount of memory for storing the signed distance ﬁeld.
Therefore,the volume was divided into small sub-volumes and kept in 3D textures of
voxels.Then,the chunks of data were processed independently,as described in
the following paragraph.
First,the signed distance transform was computed.However,it could not be im-
plemented in a straightforward manner,since it is a global operation.Instead,a lo-
cal approximation of the Euclidean distance was performed,using the Chamfer dis-
tance.During every single pass,the information about the distance is propagated using
3 33 neighbourhood voxels.Hence,every sub-volume need to be supported with
a one voxel cover from other adjointing sub-volumes,thereby reducing the computa-
tional domain to a volume of 126
pixels.Second,average intensities were computed,
by means of reduction.This does not require an advanced approached to handle the
sub-volumes.Third,the weighted median was calculated,using iterative method.Fi-
nally,the PDE solver was executed,using information froma 5 55 neighbourhood,
thus reducing the computational domain further down to 124
Takizawa and Kobayashi [TK06] proposed a k-means algorithmon a cluster of mul-
tiple computers,supported by graphics hardware.Their method was based on three-
level hierarchical approach to parallel processing.On the top,the divide-and-conquer
approach was employed in two levels.In this manner,large-scale data were divided
into partly-parallel subtasks,each performed by an individual PC.After obtaining the
data,the k-means algorithm was executed in two phases.One part,the nearest neigh-
bour seach,was implemented on GPU;the other one was performed on CPU,since it
involved random access memory writes and conditional branches,that are unappro-
priate for GPU acceleration.The algorithm was evaluated on various conﬁgurations.
The GPU-assisted implementation on a single PC achieved 4 speedup over the CPU-
only version.Further performance gain,attained by 4 computers,was approximately
3 ,showing relatively good scalability on PC clusters.
Hartley et al.[HCR
08] designed a cooperative enviroment for the analysis of
pathology images.The analysis included extraction of four second order statistical fea-
tures,namely contrast,correlation,homogeneity,and energy.For the purpose of tex-
ture analysis,the local binary pattern (LBP) feature was computed.Since one slice
of a tissue can be up to 30 gigabytes of uncompressed data,the computation would
take hours on a single CPU.Therefore,the images were decomposed into tiles of
1000 1000 pixels and processed independently and in parallel on a cluster of 16
nodes.Each node was equipped with two dual-core AMD Opteron 2218 CPUs and
two NVidia Quadro FX 5600 GPUs.Additional six computers were used as reader
nodes in order to provide enough disk bandwidth.The results revealed TIFF decom-
pression stage to be a signiﬁcant bottleneck when two graphics cards per node were
used.Three different image sizes were tested,"small",middle"and"large",with res-
olution of 2.1,4.7,and 8.5 gigapixels,respectively.The total computation times for
the particular image sizes were under 4,7,and 12 seconds,respectively.Hereat,the
analysis on a single CPU would take almost 3 hours for the large image size.
For the parallelization in the work above,the DataCutter middleware framework by
Beynon et al.[Dat01] was used.This enviroment was designed to support processing
of large scientiﬁc datasets stored in archival storage systems across a network.After
having previously developped Active Data Repository (ADR),Beynon et al.proposed
DataCutter to extend and apply the features of ADR,in particular,to provide data pro-
cessing on machines separated from storage servers.In the DataCutter architecture,
clients are supported by two important services,namely indexing and ﬁltering.The
former was proposed to retrieve spatial subsets from large datasets.Since the dataset
sizes could be up to petabytes,a multi-level hierarchical indexing scheme was used,
based on R-trees structure.The latter was employed to performnon-spatial subsetting
and data aggregation.For experimental testing of the DataCutter framework,the Vir-
tual Microscope application was developed,to emulate a high power light microscope.
For detail information refer to [BKC
3 Dissertation Thesis Intent
The main intent of my dissertation thesis will be mapping of image processing algo-
rithms to GPU.I will focus on processing of large 3D images,which we deal with in
the CBIAlaboratory.As the output of the thesis,the methods will be implemented and
included in the i3dalgo library,which is developed by our laboratory.Many parallel
programming models,as known to date,require considerably high effort for the GPU
08].Considering the trade-off between computational speed
and time consumption for software developping,I decided to choose CUDA,for it
provides good framework for programmers and,generally,signiﬁcant speedups of im-
plementations.If there will be time enough,OpenCL implementations will be studied
as well,in order to provide a comparison between the two most popular frameworks.
Since the algorithms will be applied in the light microscopy,the following image pro-
cessing methods,essential for this ﬁeld,will be studied.
Image Restoration.Convolution and deconvolution are two related methods used
for image restoration.The former is employed in CytoPacq,the online framework for
simulating ﬂuorescence microscopy images [SKS09],to imitate a blurring introduced
by an optical system of a microscope.In contrast,the latter is used to reconstruct the
original image fromthe blurred one.
Image Segmentation.Image segmentation is the fundamental part of image pro-
cessing in biomedical application.Therefore,several segmentation algorithms will
be considered,based on the methods from mathematical morphology (described be-
low).These methods can be signiﬁcantly time-consuming,the global knowledge is
needed during computation in some cases,therefore,they are challenging for GPU
Morphological Processing.The implementations of basic algorithms,such as dila-
tion,erosion,opening,and closing,have been already well-studied and described in lit-
erature.Thus,more advanced approaches will be considered,for example,watershed
transform,morphological reconstructions and reconstruction-based operators.These
algorithms are often used for image segmentation,either for pre-processing,or for
segmentation itself.The well-known bottleneck of the watershed is over-segmentation,
thus,the pre-processing is an essential part of the segmentation process.
I plan to ﬁnish my Ph.D.studies by dissertation defence in the termautumn 2012 (after
8 terms of doctoral studies).The expected time schedule of my work comes as follows:
Spring 2011—an Erasmus internship in foreign research centre.The methods
frommathematical morphology will be studied.The particular algorithms will be
chosen for my research topic,to be applied for segmentation of biomedical data.
Subsequently,the research on parallelization of the algorithms will be conducted.
Autumn 2011—I will proceed with previous work on morphological process-
ing and image segmentation.Also,deconvolution methods will be examined.
Since large ﬁlter kernels (PSFs) are typical for the light microscopy,deconvolu-
tion based on the fast Fourier transformwill be studied,in particular.
Spring 2012 —The deconvolution algorithms will be implemented on GPU and
applied on biomedical data.Both real data fromoptical microscopy and artiﬁcial
data fromthe Cytopacq simulator [SKS09] will be taken into account.
Autumn 2012 —Dissertation write-up and defense.
During the Ph.D.studies,all results of the previous work should be published on
an international conference or in a highly rated journal.
[AE10] AccelerEyes - MATLAB GPU Computing.http://www.accelereyes.
[AMD10] AMD FireStream
9270 GPU Compute Accelerator.http://www.amd.
[ATI10a] ATI Stream SDK OpenCL Programming Guide.http://developer.
[ATI10b] ATI Stream Software Development Kit (SDK) v2.1.http://developer.
C10] Jiˇrí Barnat,Petr Bauch,Luboš Brim,and Milan
Strongly Connected Components in Parallel on CUDA.Technical Re-
port FIMU-RS-2010-10,Faculty of Informatics,Masaryk University,Brno,
04] Ian Buck,TimFoley,Daniel Horn,Jeremy Sugerman,Kayvon Fatahalian,
Mike Houston,and Pat Hanrahan.Brook for GPUs:Stream Computing
on Graphics Hardware.ACMTRANSACTIONS ON GRAPHICS,23:777–
[BFRR01] Thomas Bräunl,Stefan Feyrer,Wolfgang Rapf,and Michael Reinhardt.
Parallel Image Processing.Springer,2001.
[BK03] Mario Botsch and Leif Kobbelt.High-Quality Point-Based Rendering
on Modern GPUs.In PG ’03:Proceedings of the 11th Paciﬁc Conference
on Computer Graphics and Applications,page 335,Washington,DC,USA,
2003.IEEE Computer Society.
01] Michael D.Beynon,Tahsin Kurc,Umit Catalyurek,Chialin Chang,Alan
Sussman,and Joel Saltz.Distributed processing of very large datasets
with DataCutter.Parallel Comput.,27(11):1457–1478,2001.
[BM91] S.Beucher and Centre De Morphologie Mathmatique.The Watershed
Transformation Applied To Image Segmentation.In Scanning Microscopy
[BR96] Stephen Brown and Jonathan Rose.Architecture of FPGAs and CPLDs:
A Tutorial.IEEE Design and Test of Computers,13:42–57,1996.
[BRHM96] A.F.Boden,D.C.Redding,R.J.Hanisch,and J.Mo.Massively par-
allel spatially variant maximum-likelihood restoration of Hubble Space
[BTVG06] Herbert Bay,Tinne Tuytelaars,and Luc Van Gool.SURF:Speeded Up
Robust Features.In Aleš Leonardis,Horst Bischof,and Axel Pinz,ed-
itors,Computer Vision – ECCV 2006,volume 3951 of Lecture Notes in
Computer Science,pages 404–417.Springer Berlin/Heidelberg,2006.
[BZZ09] S.Bonettini,R.Zanella,and L.Zanni.A scaled gradient projection
method for constrained image deblurring.Inverse Problems,25(1):015002,
[Cas96] Kenneth R.Castleman.Digital Image Processing.Prentice Hall,2nd edi-
08] Shuai Che,Michael Boyer,Jiayuan Meng,David Tarjan,Jeremy W.Sheaf-
fer,and Kevin Skadron.A performance study of general-purpose appli-
cations on graphics processors using CUDA.J.Parallel Distrib.Comput.,
[CC74] Christoph Cremer and Thomas Cremer.Considerations on a laser-
scanning-microscope with high resolution and depth of ﬁeld.1974.
08] Daniel Casta´no-Díez,Dominik Moser,Andreas Schoenegger,Sabine
Pruggnaller,and Achilleas S.Frangakis.Performance evaluation of im-
age processing algorithms on the GPU.Journal of Structural Biology,
164(1):153 – 160,2008.
[CLW04] Joshua E.Cates,Aaron E.Lefohn,and Ross T.Whitaker.GIST:an in-
teractive,GPU-based level set segmentation tool for 3D medical images.
Medical Image Analysis,8(3):217 – 231,2004.Medical Image Computing
and Computer-Assisted Intervention - MICCAI 2003.
[Coe10] David Coeurjolly.Volumetric analysis of digital objects using dis-
tance transformation:performance issues and applications,August 2010.
Workshop on Applications of Digital Geometry and Mathematical Mor-
[CTMT10] Thanh-Tung Cao,Ke Tang,Anis Mohamed,and Tiow-Seng Tan.Parallel
Banding Algorithm to compute exact distance transform with the GPU.
In I3D ’10:Proceedings of the 2010 ACM SIGGRAPH symposium on Inter-
active 3D Graphics and Games,pages 83–90,New York,NY,USA,2010.
CUBLAS Library 3.1.http://developer.download.nvidia.
CUFFT Library 3.1.http://developer.download.nvidia.
Programming Guide 3.1.http://developer.download.
SDK Code Samples 3.1.http://developer.nvidia.com/
[CUD10e] NVIDIA GPU Computing Developer Home Page.http://developer.
[CV01] Tony F.Chan and Luminita A.Vese.Alevel set algorithmfor minimizing
the Mumford-Shah functional in image processing.In VLSM’01:Proceed-
ings of the IEEE Workshop on Variational and Level Set Methods (VLSM’01),
page 161,Washington,DC,USA,2001.IEEE Computer Society.
[CVG08] N.Cornelis and L.Van Gool.Fast scale invariant feature detection and
matching on programmable graphics hardware.In Computer Vision and
Pattern Recognition Workshops,2008.CVPRW ’08.IEEE Computer Society
Conference on,pages 1 –8,23-28 2008.
[Dat01] DataCutter Project.http://www.cs.umd.edu/projects/hpsl/
[DDdV10] Bernardt Duvenhage,J.P.Delport,and Jason de Villiers.Implementa-
tion of the Lucas-Kanade image registration algorithm on a GPU for 3D
computational platform stabilisation.In AFRIGRAPH ’10:Proceedings of
the 7th International Conference on Computer Graphics,Virtual Reality,Visu-
alisation and Interaction in Africa,pages 83–90,New York,NY,USA,2010.
[DMAC03] F.Drago,K.Myszkowski,T.Annen,and N.Chiba.Adaptive Logarithmic
Mapping For Displaying High Contrast Scenes.Computer Graphics Forum,
[DPRS01] Udo Diewald,Tobias Preusser,Martin Rumpf,and Robert Strzodka.Dif-
fusion Models And Their Accelerated Solution In Image And Surface
[DS04] Carsten Dachsbacher and Marc Stamminger.Rendering Procedural Ter-
rain by Geometry Image Warping.In Rendering Techniques,pages 103–
[Duc76] J.Duchon.Interpolation des fonctions de deux variables suivant le
principe de la ﬂexion des plaques minces.R.A.I.R.O.Analyse num
[DVW09] L.Domanski,P.Vallotton,and D.Wang.Two and Three-Dimensional
Image Deconvolution on Graphics Hardware.In Proceedings of the 18th
World IMACS/MODSIMCongress,pages 1010–1016,Cairns,Australia,13-
17 July 2009.
[ESA05] O.C.Eidheim,J.Skjermo,and L.Aurdal.Real-time analysis of ultra-
sound images using GPU.International Congress Series,1281:284 – 289,
2005.CARS 2005:Computer Assisted Radiology and Surgery.
05] Quanfu Fan,Alon Efrat,Vladlen Koltun,Shankar Krishnan,and Suresh
Venkatasubramanian.Hardware-assisted natural neighbor interpola-
tion.In In:Proc.7th Workshop on Algorithm Engineering and Experiments
[FJ09] Matteo Frigo and Steven G.Johnson.FFTW 3.2.2.http://www.fftw.
[FM05] James Fung and Steve Mann.OpenVIDIA:parallel GPU computer vi-
sion.In MULTIMEDIA ’05:Proceedings of the 13th annual ACM interna-
tional conference on Multimedia,pages 849–852,NewYork,NY,USA,2005.
[FR] James Fung and John Roberts.OpenVIDIA:Parallel GPU Computer Vi-
[Fra04] François Goudail and Philippe Réfrégier.Statistical image processing tech-
niques for noisy images:an application-oriented approach.Springer,2004.
[Fun05] J.Fung.Computer Vision on the GPU,chapter 40,pages 649–665.Addison-
C06] O.Fialka and M.
Cadík.FFT and Convolution Performance in Image
Filtering on GPU.In Information Visualization,2006.IV 2006.Tenth Inter-
national Conference on,pages 609 –614,5-7 2006.
[GA09] Leo Grady and Christopher V.Alvino.The piecewise smooth Mumford-
Shah functional on an arbitrary graph.Trans.Img.Proc.,18(11):2547–2561,
08] Naga K.Govindaraju,Brandon Lloyd,Yuri Dotsenko,Burton Smith,
and John Manferdelli.High performance discrete Fourier transforms
on graphics processors.In SC ’08:Proceedings of the 2008 ACM/IEEE con-
ference on Supercomputing,pages 1–12,Piscataway,NJ,USA,2008.IEEE
[GLG05] Minglun Gong,Aaron Langille,and Mingwei Gong.Image Analysis
and Recognition,volume 3656/2005,chapter Real-Time Image Process-
ing Using Graphics Hardware:A Performance Study,pages 1217–1225.
08] Michael Garland,Scott Le Grand,John Nickolls,Joshua Anderson,Jim
Hardwick,Scott Morton,Everett Phillips,Yao Zhang,and Vasily Volkov.
Parallel Computing Experiences with CUDA.IEEE Micro,28(4):13–27,
09] Markus Gipp,Guillermo Marcus,Nathalie Harder,Apichat Suratanee,
Karl Rohr,Rainer König,and Reinhard Männer.Haralick’s Texture Fea-
tures Computed by GPUs for Biological Applications.2009.
[GPG] General-Purpose computation on Graphics Processing Units.http://
[Gri10] Future of computing:GPGPU?http://gridtalk-project.blogspot.
[GS05] Antonio Garcia and Han-Wei Shen.GPU-based 3D wavelet recon-
struction with tileboarding.The Visual Computer,21:755–763,2005.
[GW02] Rafael C.Gonzales and Richard E.Woods.Digital Image Processing.
[GWWH05] Nolan Goodnight,Rui Wang,Cliff Woolley,and Greg Humphreys.In-
teractive time-dependent tone mapping using programmable graphics
hardware.In SIGGRAPH ’05:ACM SIGGRAPH 2005 Courses,page 180,
[Hal08] Haldar,J.P.and Hernando,D.and Song,S.-K.and Liang,Z.-P.Anatom-
ically constrained reconstruction from noisy data.Magnetic Resonance in
[HB08] S.Harding and W.Banzhaf.Genetic programming on GPUs for image
08] Timothy D.R.Hartley,Umit Catalyurek,Antonio Ruiz,Francisco Igual,
Rafael Mayo,and Manuel Ujaldon.Biomedical image analysis on a co-
operative cluster of GPUs and multicores.In ICS ’08:Proceedings of the
22nd annual international conference on Supercomputing,pages 15–25,New
[HE00] Matthias Hopf and Thomas Ertl.Hardware Accelerated Wavelet Trans-
formations.In In Proceedings of EG/IEEE TCVGSymposiumon Visualization
VisSym ’00,pages 93–103,2000.
[HLSB04] Markus Hadwiger,Caroline Langer,Henning Scharsach,and Katja Büh-
ler.State of the Art Report 2004 on GPU-Based Segmentation.Technical
report,VRVis Research Center,Vienna,2004.
[HN07] Pawan Harish and P.J.Narayanan.Accelerating large graph algorithms
on the GPU using CUDA.In HiPC’07:Proceedings of the 14th international
conference on High performance computing,pages 197–208,Berlin,Heidel-
[How10] Lee Howes.OpenCL:Parallel Computing for CPUs and
[HS92] Stefan Hell and Ernst H.K.Stelzer.Properties of a 4Pi confocal ﬂuores-
[Jäh05] Bernd Jähne.Digital Image Processing.Springer,6th edition,2005.
[JKDP09] Byunghyun Jang,David Kaeli,Synho Do,and Homer Pien.Multi GPU
implementation of iterative tomographic reconstruction algorithms.In
ISBI’09:Proceedings of the Sixth IEEE international conference on Symposium
on Biomedical Imaging,pages 185–188,Piscataway,NJ,USA,2009.IEEE
[Joh07] John E.Stone and James C.Phillips and Peter L.Freddolino and David
J.Hardy and Leonardo G.Trabuco and Klaus Schulten.Accelerating
molecular modeling applications with graphics processors.Journal of
10] Jingfei Kong,Martin Dimitrov,Yi Yang,Janaka Liyanage,Lin Cao,Jacob
Staples,Mike Mantor,and Huiyang Zhou.Accelerating MATLAB Image
Processing Toolbox functions on GPUs.In GPGPU ’10:Proceedings of the
3rd Workshop on General-Purpose Computation on Graphics Processing Units,
pages 75–85,New York,NY,USA,2010.ACM.
[KEGM10] Dimitri Komatitsch,Gordon Erlebacher,Dominik Göddeke,and David
Michéa.High-order ﬁnite-element seismic wave propagation modeling
with MPI on a large GPU cluster.Journal of Computational Physics,In
[KES07] Martin Kraus,Mike Eissele,and Magnus Strengert.GPU-Based Edge-
Directed Image Interpolation.In Bjarne Ersbøll and Kim Pedersen,ed-
itors,Image Analysis,volume 4522 of Lecture Notes in Computer Science,
pages 532–541.Springer Berlin/Heidelberg,2007.10.1007/978-3-540-
[Kim05] Kimmel,Ron and Sochen,Nir and Weickert,Joachim,editor.Scale Space
and PDE Methods in Computer Vision.Springer,2005.Proceedings of the
5th International Conference on Scale Space and PDE Methods in Com-
puter Vision,Scale-Space 2005.
[KK04] Francis Kelly and Anil Kokaram.Fast Image Interpolation for Motion
Estimation using.In in Real Time Imaging VIII,SPIE,pages 184–194,2004.
[KmH10] David Kirk and Wen mei Hwu.Programming Massively Parallel Processors:
A Hands-on Approach.Morgan Kaufmann,2010.
[KP08] C.Kauffmann and N.Piche.Cellular automaton for ultra-fast watershed
transform on GPU.In Pattern Recognition,2008.ICPR 2008.19th Interna-
tional Conference on,pages 1 –4,8-11 2008.
[KW03] Jens Krüger and Rüdiger Westermann.Linear algebra operators for GPU
implementation of numerical algorithms.In SIGGRAPH ’03:ACM SIG-
GRAPH 2003 Papers,pages 908–916,New York,NY,USA,2003.ACM.
[LCW03] Aaron Lefohn,Joshua Cates,and Ross Whitaker.Interactive,GPU-Based
Level Sets for 3D Segmentation.pages 564–572.2003.
[LD08] Yuancheng Luo and R.Duraiswami.Canny edge detection on NVIDIA
CUDA.In Computer Vision and Pattern Recognition Workshops,2008.
CVPRW’08.IEEE Computer Society Conference on,pages 1 –8,23-28 2008.
[LDS05] David Levin,Damini Dey,and Piotr Slomka.Efﬁcient 3D nonlinear
warping of computed tomography:two high-performance implemen-
tations using OpenGL.volume 5744,pages 34–42.SPIE,2005.
[LDZZ09] Baofeng Li,Yong Dou,Haifang Zhou,and Xingming Zhou.FPGA accel-
erator for wavelet-based automated global image registration.EURASIP
10] Victor W.Lee,Changkyu Kim,Jatin Chhugani,Michael Deisher,Dae-
hyun Kim,Anthony D.Nguyen,Nadathur Satish,Mikhail Smelyanskiy,
Srinivas Chennupaty,Per Hammarlund,Ronak Singhal,and Pradeep
Dubey.Debunking the 100X GPU vs.CPU myth:an evaluation of
throughput computing on CPU and GPU.SIGARCH Comput.Archit.
[Luc74] L.B.Lucy.An iterative technique for the rectiﬁcation of observed distri-
butions.The Astronomical Journal,79:745–+,June 1974.
[Mar96] Petros A.Maragos.Mathematical Morphology and Its Applications to Image
and Signal Processing.Kluwer Academic Publishers,Norwell,MA,USA,
[MBS09] Maryam Moazeni,Alex Bui,and Majid Sarrafzadeh.Accelerating total
variation regularization for matrix-valued images on GPUs.In CF ’09:
Proceedings of the 6th ACM conference on Computing frontiers,pages 137–
04] Michael McCool,Stefanus Du Toit,Tiberiu Popa,Bryan Chan,and Kevin
Moule.Shader algebra.In SIGGRAPH’04:ACMSIGGRAPH2004 Papers,
pages 787–795,New York,NY,USA,2004.ACM.
[MOOXS08] P.Muyan-Ozcelik,J.D.Owens,Junyi Xia,and S.S.Samant.Fast De-
formable Registration on the GPU:ACUDAImplementation of Demons.
In Computational Sciences and Its Applications,2008.ICCSA ’08.Interna-
tional Conference on,pages 223 –233,june 2008.
[NOEM08] Akira Nukada,Yasuhiko Ogata,Toshio Endo,and Satoshi Matsuoka.
Bandwidth intensive 3-D FFT kernel for GPUs using CUDA.In SC ’08:
Proceedings of the 2008 ACM/IEEE conference on Supercomputing,pages 1–
[Nyl07] Nyland,Lars and Harris,Mark and Prins,Jan.Fast N-Body Simulation
with CUDA.In Hubert Nguyen,editor,GPUGems 3,chapter 31.Addison
Wesley Professional,August 2007.
05] John D.Owens,David Luebke,Naga Govindaraju,Mark Harris,Jens
Krüger,Aaron E.Lefohn,and Timothy J.Purcell.A Survey of General-
Purpose Computation on Graphics Hardware.pages 21–51,August 2005.
07] John D.Owens,David Luebke,Naga Govindaraju,Mark Harris,Jens
Krüger,Aaron Lefohn,and Timothy J.Purcell.A survey of general-
purpose computation on graphics hardware.Computer Graphics Forum,
[Ope10a] OpenCL.http://www.khronos.org/opencl/,Jun 2010.
[Ope10b] OpenCL 1.1 Reference Pages.http://www.khronos.org/registry/cl/
[Ope10c] The OpenCL Speciﬁcation 1.1.http://developer.amd.com/gpu/
[PDHW10] Brad Peterson,Manasi Datar,Mary Hall,and Ross Whitaker.GPU Ac-
celerated Particle Systemfor Triangulated Surface Meshes.Jul 2010.
09] Alexandros Papakonstantinou,Karthik Gururaj,John A.Stratton,Dem-
ing Chen,Jason Cong,and Wen-Mei W.Hwu.FCUDA:Enabling efﬁcient
compilation of CUDAkernels onto FPGAs.Application Speciﬁc Processors,
[PGX08] Lei Pan,Lixu Gu,and Jianrong Xu.Implementation of medical im-
age segmentation in CUDA.In Information Technology and Applications
in Biomedicine,2008.ITAB 2008.International Conference on,pages 82 –85,
[Pod07] Victor Podlozhnyuk.Image Convolution with CUDA.http:
[Pra01] WilliamK.Pratt.Digital Image Processing.John Wiley &Sons,3rd edition,
10] In Kyu Park,Nitin Singhal,Man Hee Lee,Sungdae Cho,and Chris Kim.
Design and Performance Evaluation of Image Processing Algorithms on
GPUs.IEEE Transactions on Parallel and Distributed Systems,99(PrePrints),
[QFTI09] Cory W.Quammen,David Feng,and Russell M.Taylor II.Performance
of 3D Deconvolution Algorithms on Multi-Core and Many-Core Archi-
tectures.Technical report,University of North Carolina at Chapel Hill,
Department of Computer Science,2009.
[RIC72] WILLIAM HADLEY RICHARDSON.Bayesian-Based Iterative Method
of Image Restoration.J.Opt.Soc.Am.,62(1):55–59,1972.
[RM07] Matthias Raspe and Stefan Müller.Using a GPU-based Framework for
Interactive Tone Mapping of Medical Volume Data.In SIGRAD 2007:
Computer Graphics in Healthcare,pages 3–10.Linköping University Elec-
tronic Press,Linköpings universitet,2007.
[ROF92] Leonid I.Rudin,Stanley Osher,and Emad Fatemi.Nonlinear total vari-
ation based noise removal algorithms.Phys.D,60(1-4):259–268,1992.
[RPZ02] Liu Ren,Hanspeter Pﬁster,and Matthias Zwicker.Object Space EWA
Surface Splatting:A Hardware Accelerated Approach to High Quality
Point Rendering.In Computer Graphics Forum,pages 461–470,2002.
08] Shane Ryoo,Christopher I.Rodrigues,Sara S.Baghsorkhi,SamS.Stone,
David B.Kirk,and Wen-mei W.Hwu.Optimization principles and ap-
plication performance evaluation of a multithreaded GPU using CUDA.
In PPoPP ’08:Proceedings of the 13th ACMSIGPLAN Symposium on Princi-
ples and practice of parallel programming,pages 73–82,NewYork,NY,USA,
[RSSF02] Erik Reinhard,Michael Stark,Peter Shirley,and James Ferwerda.Pho-
tographic tone reproduction for digital images.ACM Trans.Graph.,
[RT06] Guodong Rong and Tiow-Seng Tan.Jump ﬂooding in GPUwith applica-
tions to Voronoi diagramand distance transform.In I3D ’06:Proceedings
of the 2006 symposium on Interactive 3D graphics and games,pages 109–116,
[RUCH09] Antonio Ruiz,Manuel Ujaldon,Lee Cooper,and Kun Huang.Non-rigid
Registration for Large Sets of Microscopic Images on Graphics Proces-
[RUG07] Antonio Ruiz,Manuel Ujaldón,and Nicolás Guil.Using Graphics Hard-
ware for Enhancing Edge and Circle Detection.In Joan Martí,José
Benedí,Ana Mendonça,and Joan Serrat,editors,Pattern Recognition and
Image Analysis,volume 4478 of Lecture Notes in Computer Science,pages
[SA09] Ojaswa Sharma and François Anton.CUDA based Level Set Method
for 3D Reconstruction of Fishes from Large Acoustic Data.In Interna-
tional Conference in Central Europe on Computer Graphics,Visualization and
Computer Vision (WSCG),2009.
[SBW06] Jens Schneider,Tobias Boldte,and Rüdiger Westermann.Real-Time Edit-
ing,Synthesis,and Rendering of Inﬁnite Landscapes on GPUs.In Vision,
Modeling and Visualization 2006,2006.
[SDR03] Robert Strzodka,Marc Droske,and Martin Rumpf.Fast Image Reg-
istration in DX9 Graphics Hardware.Journal of Medical Informatics and
[SDR04] Robert Strzodka,Marc Droske,and Martin Rumpf.Image Registration
by a Regularized Gradient Flow - A Streaming Implementation in DX9
Graphics Hardware.Computing,73(4):373–389,November 2004.
[Ser83] Jean Serra.Image Analysis and Mathematical Morphology.Academic Press,
[SG04] Robert Strzodka and Christoph Garbe.Real-Time Motion Estimation and
Visualization on Graphics Cards.In Proceedings IEEE Visualization 2004,
[SHN03] Anthony Sherbondy,Mike Houston,and Sandy Napel.Fast Volume
Segmentation With Simultaneous Visualization Using Programmable
Graphics Hardware.In VIS ’03:Proceedings of the 14th IEEE Visualiza-
tion 2003 (VIS’03),page 23,Washington,DC,USA,2003.IEEE Computer
08] S.S.Stone,J.P.Haldar,S.C.Tsao,W.m.W.Hwu,B.P.Sutton,and Z.P.
Liang.Accelerating advanced MRI reconstructions on GPUs.J.Parallel
[SKE06] Magnus Strengert,Martin Kraus,and Thomas Ertl.Pyramid Methods in
GPU-Based Image Processing.pages 169–176,2006.
[SKS09] David Svoboda,Michal Kozubek,and Stanislav Stejskal.Generation of
Digital Phantoms of Cell Nuclei and Simulation of Image Formation in
3D Image Cytometry.CYTOMETRY PART A,75A(6):494–509,JUN 2009.
[SKSF07] G C Sharp,N Kandasamy,H Singh,and M Folkert.GPU-based
streaming architectures for fast cone-beam CT image reconstruction
and demons deformable registration.Physics in Medicine and Biology,
[SM06] Jean-Luc Starck and Fionn Murtagh.Astronomical Image and Data Analy-
[SM08] H.Sugano and R.Miyamoto.Parallel implementation of morphological
processing on Cell/BE with OpenCV interface.In Communications,Con-
trol and Signal Processing,2008.ISCCSP 2008.3rd International Symposium
on,pages 578 –583,12-14 2008.
[SN06] P.Sarder and A.Nehorai.Deconvolution methods for 3-D ﬂuorescence
microscopy images.Signal Processing Magazine,IEEE,23(3):32 – 45,may
[SV82] L.A.Shepp and Y.Vardi.MaximumLikelihood Reconstruction for Emis-
sion Tomography.Medical Imaging,IEEE Transactions on,1(2):113–122,
[SZZ10] Thomas Seraﬁni,Riccardo Zanella,and Luca Zanni.Gradient projection
methods for image deblurring and denoising on graphics processors.
Advances in Parallel Computing,19:59–66,2010.
[Tes09] Tesla C1060 Computing Processor Board.http://www.nvidia.com/
[Tes10] Tesla C2050/C2070 GPU Computing Processor.http://www.nvidia.
[TH78] H.Trussell and B.Hunt.Image restoration of space variant blurs by
sectioned methods.In Acoustics,Speech,and Signal Processing,IEEE Inter-
national Conference on ICASSP ’78.,volume 3,pages 196 – 198,apr 1978.
[TK06] Hiroyuki Takizawa and Hiroaki Kobayashi.Hierarchical parallel process-
ing of large scale data clustering on a PCcluster with GPUco-processing.
[Uja09] Manuel Ujaldon.GPU acceleration of Zernike moments for large-scale
images.In IPDPS ’09:Proceedings of the 2009 IEEE International Symposium
on Parallel&Distributed Processing,pages 1–8,Washington,DC,USA,2009.
IEEE Computer Society.
[vDvdWT06] M.van Dortmont,H.van de Wetering,and A.Telea.Skeletonization and
Distance Transforms of 3D Volumes Using Graphics Hardware.In Attila
Kuba,László Nyúl,and Kálmán Palágyi,editors,Discrete Geometry for
Computer Imagery,volume 4245 of Lecture Notes in Computer Science,pages
[Ver98] Peter J.Verveer.Computational and optical methods for improving resolu-
tion and signal quality in ﬂuorescence microscopy.PhD thesis,Technische
Universiteit Te Delft,1998.
[VN08] Vibhav Vineet and P.J.Narayanan.CUDA cuts:Fast graph cuts on the
GPU.Computer Vision and Pattern Recognition Workshop,0:1–8,June 2008.
[Wie64] Norbert Wiener.Extrapolation,Interpolation,and Smoothing of Stationary
Time Series.The MIT Press,1964.
[WLHW07] Tien-Tsin Wong,Chi-Sing Leung,Pheng-Ann Heng,and Jianqing Wang.
Discrete Wavelet Transform on Consumer-Level Graphics Hardware.
Multimedia,IEEE Transactions on,9(3):668 –673,april 2007.
[WN] Piotr Wendykier and James G.Nagy.Parallel Colt:A High Performance
Java Library for Scientiﬁc Computing and Image Processing.ACMTrans-
actions on Mathematical Software,37(3).
[XCZ09] Yongshun Xiao,Zhiqiang Chen,and Li Zhang.Accelerated CT Recon-
struction Using GPU SIMD Parallel Computing with Bilinear Warping
Method.Information Science and Engineering,International Conference on,
[XM05] Fang Xu and K.Mueller.Accelerating popular tomographic reconstruc-
tion algorithms on commodity PC graphics hardware.Nuclear Science,
IEEE Transactions on,52(3):654 – 663,june 2005.
[XM09] Wei Xu and Klaus Mueller.Accelerating regularized iterative CT recon-
struction on commodity graphics hardware (GPU).In ISBI’09:Proceed-
ings of the Sixth IEEE international conference on Symposium on Biomedical
Imaging,pages 1287–1290,Piscataway,NJ,USA,2009.IEEE Press.
[YNAS09] Lee Seng Yeong,Christopher Wing Hong Ngau,Li-Minn Ang,and
Kah Phooi Seng.Efﬁcient processing of a rainfall simulation watershed
on an FPGA-based architecture with fast access to neighbourhood pixels.
EURASIP J.Embedded Syst.,2009:4–4,2009.
[YQ08] Lei Yu and Dawei Qi.Improved multi-scale and structuring element
morphological detection in the log CT image.In Cybernetics and Intelligent
Systems,2008 IEEE Conference on,pages 1248 –1253,21-24 2008.
[Zit03] Barbora Zitová.Image registration methods:a survey.Image and Vision
04] C.Lawrence Zitnick,Sing Bing Kang,Matthew Uyttendaele,Simon
Winder,and Richard Szeliski.High-quality video view interpolation us-
ing a layered representation.ACMTrans.Graph.,23(3):600–608,2004.
09] Jian Zhu,Youquan Liu,Kai Bao,Yuanzhang Chang,and Enhua Wu.
Warping of a spherical representation of image-based models on GPU.In
VRCAI ’09:Proceedings of the 8th International Conference on Virtual Reality
Continuum and its Applications in Industry,pages 89–94,New York,NY,
4 Summary of Study Results
During the ﬁrst three terms of doctoral study I succesfully passed courses focused on
image processing methods,namely Advanced Methods of Digital Image Processing
(FI:PA166),Digital Image Filtering (FI:PA171),Mathematical Morphology (FI:PA173),
and Digital Geometry (FI:PA170).I learned a number of methods from mathematical
morphology,image ﬁltering,edge detection,and image segmentation.
As a part of my study,I attended the Image Acquisition and Analysis in Microscopy
workshop in Prague in May,2009.I also attended the Spring School of Image Processing
in Mariánská in June,2010.
I read papers with results of past researches in the area of GPUacceleration.In par-
ticular,I focused on implementations of image processing algorithms.Also I learned
the CUDAparallel programming model and succesfully passed the GPUProgramming
I implemented several algorithms in CUDA,namely convolution,morphological re-
construction,and a number of reconstruction-based operators (for detailed description
refer to the following paragraphs).At CBIA FI MU,we develop i3dcore and i3dalgo,
cross-platformC++ libraries for 3Dimage processing.These libraries are nowextended
to support GPU-based processing and the image processing algorithms implemented
in CUDA will be soon included in the libraries.
I implemented convolution of large 3D images,based on the fast Fourier transform,on
GPU.The CUDAframework and the CUFFT library were used for the implementation.
With D.Svoboda we proposed a method to decompose large 3D images using the
decimation in frequency (DIF) algorithm,so that we are able to process images that are
too large to be stored in the GPU memory.The paper on the topic was submitted to
EURASIP Journal on Advances in Signal Processing [KS10].
I studied efﬁcient algorithms for computing the morphological reconstruction,de-
scribed in literature.I proposed a GPU-based implementation in CUDA and com-
pared the performance of all approaches.The paper on the topic was submitted to the
MEMICS 2010 conference [K10].
References to the submitted papers
[KS10] Pavel Karas and David Svoboda.Convolution of Large 3D Images on GPU and
its Decomposition.EURASIP Journal on Advances in Signal Processing.Hindawi,
[K10] Pavel Karas.Efﬁcient Computation of Morphological Greyscale Reconstruc-
tion.MEMICS 2010 Annual Doctoral Workshop on Mathematical and Engineering
Methods in Computer Science,submitted.
There is a theory which states that if ever anyone discovers exactly
what the Universe is for and why it is here,it will instantly disappear
and be replaced by something even more bizarre and inexplicable.
There is another theory which states that this has already happened.
Douglas Adams,The Restaurant at the End of the Universe