Image registration on GPU

pumpedlessSoftware and s/w Development

Dec 2, 2013 (3 years and 10 months ago)

123 views

INSTITUT SUPÉRIEUR D’INFORMATIQUE DE MODÉLISATION ET DE LEURS APPLICATIONS
ISIMA - University of Blaise Pascal - CSIRO
Final year project
Computation and Scientific Modeling
Image registration on GPU
Authors:
Jérémy Coatelen
Yu Qin
Supervisors:
Nicholas Dowson
Vincent Barra
Jonathan Caux
March 14,2011
Acknowledgment
Thanks to our tutor,Nicholas Downson,from the
Commonwealth Scientific and Industrial
Research Organisation (CSIRO)
and Vincent Barra and Jonathan Caux,our project tutors at the
Institut Supérieur d’Informatique,de Modélisation et de leurs Applications (ISIMA)
.They helped
us lots during our project.Thanks to our teachers,especially the teachers of the computation and
scientific modeling option.Thank you for your valuable advice and support,encouragement and
guidance.
i
Abstract
Image registration is a fundamental step in many applications involving image analysis.It
consists of optimizing a similarity metric to find a spatial transformation to match two images (in
3D).It has application in medical images to build atlases (registering a population),or to align
a patient to a template to detect pathologies.The main objective of this project is to investigate
different image registration algorithms in the context of developing a cloud computing service that
it will be used in medical images application.
In order to reach this goal,at first we ported in C++ simple block matching algorithms,we
used a local brute force searching for the block in the fixed image that best matches each block
within the moving image.Several different classes of C++ are built for the matching processing
only based on rigid transformation.At first we used the sequential block matching algorithm,how-
ever its complexity is too expensive,so we used the parallel block matching system using OpenCL
library (Graphics Processing Unit programming).
The result is a list of matched coordinates which could be seen as a list of local transformations.
Fo each local transformation we have a score which shows the trust on that transformation.With
these results we can then find the global rigid transformation between both images.
Keywords:Image registration,block matching,optimize,rigid transformation,sequential,
parallel,OpenCl,C++
ii
Glossary
API
Application Programming Interface.
7
BMA
Block Matching Algorithm.
11
,
12
,
27
CPU
central Processing Unit.
7
CSIRO
Commonwealth Scientific and Industrial Research Organisation.
i
,
1
,
2
,
27
CT
Computed Tomography.
7
CUDA
Compute Unified Device Architecture.
7
,
24
DICOM
Digital Imaging and Communication in Medicine.
iii
GPGPU
General-Purpose computing on Graphics Processing Units.
7
GPU
Graphics Processing Unit.
3

5
,
7
,
8
,
21
ISIMA
Institut Supérieur d’Informatique,de Modélisation et de leurs Applications.
i
kernel
An elementary task to execute on each thread of the working group (written using the
C++ language).
17
MRI
Magnetic Resonance Imaging.
1
,
3
NIfTI
Neuroimaging Informatics Technology Initiative.
iii
OpenCL
Open Computing Language.
7
,
8
,
24
,
25
RTE
Rigid Transformation Estimation.
11
,
12
,
27
VPU
Visual Processing Unit.
4
working group
Set of working elements (threads) in the GPU architecture.
18
,
25
iii
List of Symbols
List of Notations:
W
Width of the images
H
Height of the images
D
Depth of the images
I
r
The reference image
I
m
The moving image
(i,j)
Coodinates in I
r
(k,l)
Coodinates in I
m
B
i,j
Block of N ×N pixels in size,centered on (i,j) in I
r
B
￿
r,s
Block of N ×N pixels in size,centered on (r,s) in I
m
C
r,s
i,j
Similarity criterion between B
i,j
and B
￿
r,s
(Ch.
II.3
)
B
i,j
Mean of B
i,j
σ(B
i,j
)
Standard deviation of B
i,j
V
i,j
Neighbourhood of (i,j)
The main parameters of the algorithm:
N
Size of a block of pixels
Ω
Amount of pixels for the search area
Δ
1
Amount of pixels to skip between two points in I
r
Δ
2
Amount of pixels to skip between two points in I
m
δ
Range of pixels to skip from the sides of the images
iv
Contents
Abstract
ii
Glossary
iii
List of Symbols
iv
Introduction
1
I Presentation
2
I.1 CSIRO
...........................................
2
I.2 Context
..........................................
3
I.2.1 Medical imaging
.................................
3
I.2.2 Implementation on GPU
.............................
4
I.3 Tools
............................................
5
I.3.1 General programming
..............................
5
CMake
.......................................
5
Versioning tool:Subversion
...........................
5
Doxygen
......................................
6
CImg
.......................................
6
MedCon
......................................
6
Boost Program Options
.............................
7
I.3.2 GPU languages
..................................
7
CUDA
.......................................
7
OpenCL
......................................
7
II Study
10
II.1 Approach
.........................................
10
II.2 Theory
...........................................
11
II.3 Implementation
......................................
13
II.3.1 The general structure
..............................
13
UML Diagram and file organization
......................
13
v
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
Design pattern Strategy
.............................
14
II.3.2 MedCon integration
...............................
14
II.3.3 Block Matching algorithm
............................
16
Sequential implementation
............................
17
Parallel implementation
.............................
17
IIIResults
20
III.1 Experimental protocol
..................................
20
III.2 Tests
............................................
21
III.2.1 Time complexity
.................................
21
IVDiscussion
24
IV.1 Difficulties
.........................................
24
IV.1.1 Learning the GPU programming
........................
24
IV.1.2 Work on ATI device
...............................
24
IV.1.3 Take into account the GPU device constraints
.................
25
Crash system issue
............................
25
Debug methods
..............................
25
Memory constraints
...........................
25
IV.2 Considered optimizations and add-ons
.........................
26
IV.2.1 Memory access in the kernels
..........................
26
IV.2.2 Median variance implementation
........................
26
Conclusion
27
References
30
A Getting start
31
B Kernels implementation
35
vi
Introduction
Nowadays,the development of medical imaging techniques is well established.Now it is possible
to visualize the internal anatomy of the body by using modern techniques like
Magnetic Resonance
Imaging (MRI)
.The problem is that we are still looking for good image processing algorithms to
make better diagnosis of the diseases.
In this context,we worked with Nicholas Dowson of the
CSIRO
to implement the first part of
image registration algorithm which is often used in medical applications.At first,this algorithm
is to get information of similarity between an image taken from the patient and a reference image,
and then deduce the optimal transformation between two images.How to optimize this measure
of similarity?
Our goal is to offer an interface in C++ language allowing to get the similarity measures and
the results of the matching algorithm,fundamental for image registration algorithm.The interface
presents different ways to generate the results with the implementation of a sequential version and
a parallel version of the algorithm using graphics card programming.
At first,we are going to present the context of the project.Then we will see the study on the
implementation that we made.Then we will observe the results and at last,we will discuss about
the accomplished work.
1
Chapter I
Presentation
Contents
I.1 CSIRO
.....................................
2
I.2 Context
....................................
3
I.2.1 Medical imaging
...............................
3
I.2.2 Implementation on GPU
...........................
4
I.3 Tools
......................................
5
I.3.1 General programming
............................
5
I.3.2 GPU languages
................................
7
I.1 CSIRO
CSIRO
[
7
] is an independent statutory authority constituted and operating under the provisions
of the Science and Industry Research Act (1949).Its main role is to carry out scientific research of
benefit to Australia and to facilitate the application or utilisation of the results of that research.
The
CSIRO
can trace its history back to 1916,when it began work as the Advisory Council
of Science and Industry.It has operated under its current name since 1949.
CSIRO
carries out
research in primary,secondary and tertiary industries as well as other areas such as environment,
conservation and human nutrition.
2
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
CSIROexpertise is organized into 13 research areas such as Astronomy and Space Science,Earth
Science and Resource engineering,Ecosystem Sciences.And we are working with Mr.Nicholas
Dowson,from the Imaging and Computer Vision Section.
I.2 Context
In order to understand what is at stake in this project,let us look the current context from a
medical imaging point of view and then the parallel implementation on
Graphics Processing Unit
(GPU)
.
I.2.1 Medical imaging
The medical imaging is a medical science domain which gives doctors means of diagnotic due
to the images of the patient.There are two kind of medical imaging:anatomical imaging and
functional imaging.
The anatomical imaging allows to view the internal structures of the body,we can examine for
example the brain cavities sizes,the bones shape,the internal tissues quality,etc.The
MRI
and
the Scanner can get such images.
(a) Embryo
(b) Brain
Figure I.1:Anatomical imaging examples using MRI
The functional imaging gives information on the organs function,we can see thus unusual be-
haviours,wrong circulation of fluids and the location of the cancers that fix the markers.The
nuclear imaging can obtain such images.
3
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
(a) diffusion of water molecules
(b) attachment areas of cancer,nu-
clear imaging
Figure I.2:Functional imaging examples
Nowadays,the images are 2D or 3D images and the measuring devices are performing.
In this project,we work on the neuronal degeneration,so on the anatomical images of the brain
but the algorithm is not only made for that kind of images.
I.2.2 Implementation on GPU
During the realization of our project,we had to implement the processing
GPU
,it is also called
Visual Processing Unit (VPU)
,is a specialized microprocessor that offloads and accelerates graph-
ics rendering from the central processor.
GPU
has a very huge capacity in parallel calculus;its performance is evolving progressively in
nowadays [
9
].
Figure I.3:NVidia GPU
Modern
GPUs
use most of their transistors to do calculations related to 3D computer graph-
4
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
ics.The developments in GPUs include support for programmable shaders which can manipulate
vertices and textures with many of the same operations supported by
GPUs
,oversampling and
interpolation techniques to reduce aliasing,and very high-precision color spaces.Because most of
these computations involve matrix and vector operations,engineers and scientists have increasingly
studied the use of
GPUs
for non-graphical calculations.In addition to the 3D hardware,today’s
GPUs
include basic 2D acceleration and frame buffer capabilities.
Medical imaging is one of the earliest applications to take advantage of
GPU
computing to get
acceleration.
I.3 Tools
Some programming languages or libraries are necessary in order to program the algorithm.
I.3.1 General programming
The main project language is the C++ language.This programming language is efficient and
allows to manage classes according to the general concepts of polymorphism.the procedural pro-
gramming,agent-oriented programming and generic programming are very usefull for this project
as we will see for the Strategy structure (Ch.
II.3.1
).
CMake
CMake [
2
] is a cross-platform and open-source build system.It is a family of tools designed to
build,test and package software.CMake is used to control the software compilation process using
simple platform and compiler independent configuration files.CMake generates native makefiles
and workspaces that can be used in the compiler environment of your choice.
In our case,it makes the Makefile with all the dependencies with the different libraries.
Versioning tool:Subversion
Subversion [
5
] is a free/open source version control system.Subversion manages files and di-
rectories,and the changes made to them,over time.This allows you to recover older versions of
our data or examine the history of how your data changed.
Subversion can operate across networks,which allows it to be used by people on different com-
puters.At some level,the ability for various people to modify and manage the same set of data
from their respective locations fosters collaboration.Progress can occur more quickly without a
single conduit through which all modifications must occur.And because the work is versioned,
5
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
there is not fear about the changes,it is possible to work on an older revision.
Some version control systems are also software configuration management systems.These sys-
tems are specifically tailored to manage trees of source code and have many features that are
specific to software development.Subversion,however,is not one of these systems.It is a general
system that can be used to manage any collection of files.Those files might be source code or
anything from grocery shopping lists to digital video mixdowns and beyond.
In our online repository (appendix
A
),we can find 3 main folders:trunk,tags and branch.
The first one is used for the general development,the second one contains all the release versions
of the software and finally the branch folder that is supposed to contain unstable versions has not
been used.
Doxygen
Doxygen [
4
] is a documentation generator for the programming languages C,C++,C#,For-
tran,Java,Objective-C,PHP,Python,IDL VHDL,etc.At the most of time,it runs on Unix-like
systems,but it also works on Windows.the Doxygen code was written by Dimitri van Heesch.
Doxygen is a tool which is able to write software reference documentation.The documentation
is written within code,and is thus relatively easy to keep up to date.Doxygen can cross reference
documentation and code,so that the reader of a document can easily refer to the actual code.
CImg
The Cimg library [
1
] is also used in our implementation.it is an open source,C++ toolkit
for image processing.It is easy to use,efficient and it is intended to be a very pleasant toolbox
to design image processing algorithms in C++.Due to its generic conception,it covers a wide
range of image processing applications.Words that summarize:usefullness,genericity,portability,
simplicity,extensibility and freedom.
We use this library to store and display images.Our algorithms work on the datas buffer of
the images,we do not use the pre-built algorithms of this library.
MedCon
Medcon [
10
] is an open source toolkit for medical image conversion,it is released under the
GNU (L)GPL license,it has huge application in image processing.The program allows to read
unsupported files without compression,to print pixel values or to extract/reorder/reslice study
images.It is possible to export images as PNG or annimated GIF for desktop applications.Its
6
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
library can be used as a framework.
This library enables us to read and convert the medical image formats as Nifti or Dicom for
our CImg Image buffer storage.
Boost Program Options
With the Boost Program Options library we integrated command line options as it is detailed
in the appendix
A
.With this,we set the input and output files and the parameters of the block
matching algorithm.It is very important when we want to automate all the tests,with the program
which is like a black box.
I.3.2 GPU languages
CUDA
Compute Unified Device Architecture (CUDA)
[
3
] is NVIDIA’s parallel computing architecture.
It enables dramatic increases in computing performance by harnessing the power of the
GPU
.With
millions of CUDA-enabled
GPUs
sold to date,software developers,scientists and researchers are
finding broad-ranging uses for
CUDA
,including image and video processing,computational biology
and chemistry,fluid dynamics simulation,
Computed Tomography (CT)
image reconstruction,
seismic analysis,ray tracing,and much more.Computing is evolving from “central processing“ on
the CPU to ”co-processing“ on the
central Processing Unit (CPU)
and
GPU
.To enable this new
computing paradigm,NVIDIA invented the CUDA parallel computing architecture that is now
shipping in GeForce,ION,Quadro,and Tesla GPUs,representing a significant installed base for
application developers.
OpenCL
Open Computing Language (OpenCL)
[
6
] is a framework which enables to write programs on
different kind of processors as
CPUs
and
GPUs
.It includes a laguage based on the C language
for writing kernels (functions which are executed on the devices).The user is able to define and
control the platforms through the
Application Programming Interface (API)
.
OpenCL
provides
parallel computing using task-based and data-based parallelism.The system is cross-platform and
cross-device,it could run on both AMD/ATI and NVidia graphic cards and offers an effective
and portable
General-Purpose computing on Graphics Processing Units (GPGPU)
programming
environment.
The computer used for the tests is based on a ATI card,we must use OpenCL in our case
but even if,with
CUDA
,
OpenCL
seems to be the slowest,we know that our code can run on all
devices.This was a priority to have a program which works at first and then optimize.A
CUDA
7
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
version of our project could be a nice optimization for the NVidia based computers.
Now we need to define the internal structure used by
OpenCL
.The compute device,which
will be here a
GPU
device,has a global memory.The global memory is the slowest memory of the
GPU
but it is available from anywhere.This platform is divided into groups.Each group is called
a work-group (or working group) and it has its own local memory.A work-group is composed of
work-items (or working items or threads) and each work-itemhas its own private memory (Fig.
I.4
).
Figure I.4:Internal structure used by OpenCL
From here,we need to think about how to implement parallel version of our algorithm.Let’s
see a very simple example of how it is done in
OpenCL
on the figure
I.5
.We can see that the way
we write the function is different in each case.The parallel implementation is defined into a kernel
which is compiled and executed in each thread.This extract of code will be read n times (one time
per elementary task/working item).The variable i
d
contains the rank of the item (the index) in
the vectors of n items.
8
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
The __global keyword is used to declare a global data in the architecture we defined and
__local is used to declare a local data (only available from the working group).The __kernel
keyword defines a kernel definition.In our project we have defined two kernels as you can see
section
II.3.3
.
Figure I.5:Sequential and parallel version of the same function
9
Chapter II
Study
Contents
II.1 Approach
...................................
10
II.2 Theory
.....................................
11
II.3 Implementation
...............................
13
II.3.1 The general structure
............................
13
II.3.2 MedCon integration
.............................
14
II.3.3 Block Matching algorithm
..........................
16
II.1 Approach
When someone has got pathology,it is often necessary to capture images to visualize anatom-
ical and functional condition of his body.This need is even more justified when the pathology
is located in the skull because it is very dangerous to open the skull to make the diagnosis.So,
we realize the importance of having coherent and precise images,with high quality for medical
observation.
When patient with an Alzheimer’s disease
1
is treating,it is compulsory to follow his pathology
in order to visualize the evolution of neuronal areas.During this phase,doctors need to compare
3D images of the brain taken at different moments,and that is when the image registration will
be crucial.
Indeed,we have to adjust the superposition as well as possible in order to compare two images.
The interest is to have a qualitative and quantitative perspective on the evolution of the brain
1
Alzheimer’s disease:Aprogressive degenerative disease that alters the brain,causing impaired memory,thinking
and behavior.
10
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
cavities.Then possible to observe these conditions according to different captures of neuronal
degeneration.
II.2 Theory
The image registration algorithm introduces two algorithms:the
Block Matching Algorithm
(BMA)
and the
Rigid Transformation Estimation (RTE)
.
The role of the
BMA
is to give a list of matches with their associated similarity criterion score.
The
RTE
allows to find the tranformation between both images thanks to the list of vectors given
by the
BMA
.
We have to define some hypothesis to simplify the problem:we suppose that we need to find
a rigid transformation T between both images.
A rigid transformation represents the linear and/or angular displacement of a rigid body;it
can be formally defined as:
T:V ￿−→R.V +t (II.1)
subject to:R
T
= R
−1
and det(R) = 1.
Here,V is a vector,T(V ) is the transformed vector,R is a rotation matrix (orthogonal matrix
with determinant 1),and t a translation vector.Notice that the second constraint excludes re-
flections,as reflections are represented by orthogonal matrices with determinant −1.So,the rigid
transformation has 6 degrees of liberty:rotation + translation.And the similarity transformation
is not included in the rigid transformation.
The
BMA
is a way of locating matching blocks in a sequence of images (here the size of the
sequence is only 2).
This algorithm works by performing a local brute force search for the block B
i,j
in the fixed
image that best matches each block B
￿
r,s
within the moving image.For each set of blocks B
i,j
and
B
￿
r,s
we compute the similarity criterion C
r,s
i,j
and the best score gives the match.
We define the similarity criterion C
r,s
i,j
as below:
C
r,s
i,j
=
r+
N
2
￿
a=r−
N
2
s+
N
2
￿
b=s−
N
2
B
i,j
×
B
￿
a,b
σ(B
i,j
) ×σ(B
￿
a,b
)
(II.2)
11
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
So,the
BMA
gives a list of matches which represent a list of vectors with their associated score.
The k
th
match gives (i,j)
k
and (r,s)
k
vectors and the score (C
r,s
i,j
)
k
.
The
RTE
estimates the rigid transformation with the previous vectors.A standard approach
to solve such a problem is to perform a least squares estimation on the matched points:
(
ˆ
R,
ˆ
t) = Argmin
(R,t)
￿
k
￿r
k
￿
2
(II.3)
where r
k
= (i,j)
k
−R∗ (r,s)
k
−t are the residual errors and ￿.￿ the Euclidean norm.
We do not have implemented the
RTE
for this project,we deplore the lack of time.
Let’s consider the 2 simple images at the figure
III.1
that we divide by blocks.
(a) reference image
(b) moving image
Figure II.1:Input images.
Our objective is to implement a method which is able to “match” the corresponding blocks.
(a) reference image
(b) moving image
(c) matches overview
Figure II.2:Step of the block matching algorithm.
In the figure
II.2
we can see that the purple mached points do not define the same block coor-
dinates whereas the red matched points define two blocks with the same coordinates.The vectors
we got are represented with red arrows,there is an horizontal vector for the purple match and the
12
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
vector for the red match is null.We have to apply this principle to all the blocks in the images.
II.3 Implementation
II.3.1 The general structure
UML Diagram and file organization
The global architecture of the program is represented on the figure
II.3
.
Figure II.3:UML Diagram (using Umbrello)
When you download the project files,you can find severals directories:

bin:executable file location

data:images sources

doc:documention

src:C++ source files (and the associated headers)

utils:Some usefull bash scripts about the project
The main.cpp is the main file of the project.It calls the BlockMatcherManager which runs the
process and manage the command line options.The full list of options is available at the end of
the appendix
A
.
The strategies for the blockmatching algorithm are defined in both classes BlockMatcherSe-
quential and BlockMatcherParallel which implement the abstract class BlockMatcher (Ch.
II.3.3
).
As explained in the previous section,the input of the program is a set of two images (2D or
3D) and a set of parameters for the matching algorithm.The output files are some images which
13
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
represent the matches and the scores and a report file which contains the list of matches and other
usefull information.
Design pattern Strategy
A design pattern is a general reusable solution to a commonly occurring problem in software
design,in other words,it is a formal template which is made to solve particular programming
problems.It is thought to be effective.
A certain number design patterns already exists,on of whom is the design pattern Strategy.
This is made exactly for our project kind of issue.We have here two ways to write the algorithm
corresponding to the derived classes defined in the previous section.The BlockMatcherSequential
class implements the sequential version of the blockmatching algorithm and the BlockMatcherPar-
allel class implements the parallel version.
The BlockMatcherManager has as an attribute a pointer (BlockMatcher *) defined on a strat-
egy.Then its methods can be called without thinking about the strategy nature (sequential or
parallel).This define the interface which is finally the same no matter how is initialized the man-
ager.The manager is able to keep time for the matching step.
We needed to validate this interface to be sure that the result given by a strategy is the same
as the result given by the other one.This was done using one of the usefull bash script we made
which gives the graph of the distance between the results of both methods.
II.3.2 MedCon integration
The use of MedCon library was necessary when we began to consider the medical standard
format implementation.After a visit of a nuclear imaging laboratory and several talks about the
standard formats,two medical formats are obvious:Nifti and Dicom.The format which is used
by Nicholas Dowson is the Nifti format while obviously the most used format in the french lab-
oratories is Dicom [
8
].As you can see on the figure
II.4
the doctors and the developers can use
3D viewers which are compatible with the common medical formats.MedCon is installed with a
graphical image viewer called xmedcon which great.
In order to understand the necessity of MedCon,let us see an extract of the Dicom image data
format:
14
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
Figure II.4:Medical 3D image viewer


Transfer
Syntax
UID
(0002,0010) 1
UI
[1.2.840.10008.1.2.1]
Image
Type
(0008,0008) 1-
n
CS
[
ORIGINAL
\
PRIMARY
]
Study
Date
(0008,0020) 1
DA
[1995.06.26]
Image
Date
(0008,0023) 1
DA
[1995.06.26]
Study
Time
(0008,0030) 1
TM
[11:20:00]
Modality
(0008,0060) 1
CS
[
MR
]
Manufacturer
(0008,0070) 1
LO
[
Philips
]
Patient
s
Name
(0010,0010) 1
PN
[
Doe
John
]
Slice
Thickness
(0018,0050) 1
DS
[10.00]
Series
Number
(0020,0011) 1
IS
[1]
Image
Number
(0020,0013) 1
IS
[103]
Samples
per
Pixel
(0028,0002) 1
US
[1]
Photometric
Interpretation
(0028,0004) 1
CS
[
MONOCHROME2
]
Number
of
Frames
(0028,0008) 1
IS
[16]
Frame
Increment
Pointer
(0028,0009) 1-
n
AT
[(0018,1063)]
Rows
(0028,0010) 1
US
[256]
Columns
(0028,0011) 1
US
[256]
Bits
Allocated
(0028,0100) 1
US
[8]
Bits
Stored
(0028,0101) 1
US
[8]
High
Bit
(0028,0102) 1
US
[7]
Pixel
Representation
(0028,0103) 1
US
[0]
Pixel
Data
(7
FE0
,0010) 1
OB
[1048576
bytes
at
offset
1022 (0
x3fe
)]




Where the first column is a set of tags as “Patient’s Name” or “Number of Frames” and at the
right the corresponding values according to the scheme on the figure
II.5
.
There is a huge number of available tags,this extract is not representative.What is important
15
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
Figure II.5:Dicom data elements.
to understand here is that medical formats are difficult to manipulate.So,we prefered to use the
MedCon to read/convert/write such elaborate formats.
II.3.3 Block Matching algorithm
A general blockmatching algorithm is introduced in the algorithm
1
[
11
,
12
].Let us not forget
all the symbols we can find here are defined in the list of symbols (p.
iv
).An other important
thing is that we present the 2D version of the algorithms in all the chapter for convenience,but,
there is no difficulty going through 3D.
Algorithm 1 General Block Matching algorithm
Require:
I
r
(reference image),I
m
(moving image)
for all (i,j) ∈ I
r
do
Compute (a,b) = Argmax
(r,s)∈I
m
∩V
i,j
C
r,s
i,j
Save the match (i,j) (a,b) and the corresponding score C
a,b
i,j
end for
Ensure:
List of {matches + scores}
The algorithm can be broken down into few parts:
1.
We choose a set of reference points in I
r
,
2.
For each reference point we define a set of moving points in I
m
but in the neighbourhood of
the reference point,
3.
Then we keep the moving point which has the greater similarity criterion value as the good
match for the reference point.
The similarity criterion is evaluated thanks to the mean and the standard deviation values of
the reference point and the moving point.For each class,we compute the mean and variance images
16
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
at the initialization step of the class.These datas are used for the similarity criterion computation
which is defined in the section
II.2
.
Sequential implementation
The detailed version of the sequential algorithm is shown in the algorithm
2
.
Algorithm 2 Sequential Block Matching algorithm (detailed)
Require:
I
r
,I
m
,Ω,Δ
1

2

i
min
= δ;i
max
= W −δ
j
min
= δ;j
max
= H −δ
for j = j
min
to j
max
step Δ
1
do
for i = i
min
to i
max
step Δ
1
do
bestScore = −∞
for s = j −Ω to j +Ω step Δ
2
do
for r = i −Ω to i +Ω step Δ
2
do
Compute C
r,s
i,j
if C
r,s
i,j
> bestScore then
bestScore = C
r,s
i,j
bestCouple = (r,s)
end if
end for
end for
Save the match (i,j) bestCouple and the corresponding score bestScore
end for
end for
Ensure:
List of {matches + scores}
Parallel implementation
The parallel implementation was not easy.As explained in the section
IV.1
,there were diffi-
culties to develop the current version of the BlockMatcherParallel class.We worked on the
kernels
which are named kMeanVariance.cl and kMatch.cl.The mean and variance images computation
is made by the first kernel kMeanVariance while the class is initializing.The parallel matching
algorithm is written into the kMatch kernel.The kernels implementation can be found in the
appendix
B
.
When we write a kernel,we need to be well aware of the role of the elementary task we are
working on.This means that an elementary task is a thread which is able to work on different
17
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
kind of data depending on the initialization of the kernel and the
working groups
.
Here,we initialize the system in such a way that each thread work at the pixel scale level.For
a given elementary task (for a given pixel),we can get the global index with that command:


int
ig
=
get_global_id
(0);




where ig is the global index of the thread.The local index of the thread (index of the elementary
task in the
working group
) should be got with that command:


int
iloc
=
get_local_id
(0);




In our kernels,we use the global memory all the time which is not perfect.The input im-
ages are global datas (declared __global),that is the way it has to be,but on a local level,we
do a lot of global memory accesses,which is not great (we should work on a storage space de-
clared as __local).We will see that in the section
IV.2
.The input and output arguments are
buffers that we need to allocate.The results are finally copied fromthe output buffers of the kernel.
The more tedious thing when you implement using OpenCL is to initialize the OpenCL con-
figuration.For example we need to define one by one each kernel argument (with clSetKernelArg
function) before calling the kernel execution,which is not required in CUDA language.Here is a
summary of the initialization procedure that you can find entirely in the method BlockMatcher-
Parallel::init implementation:
1.
Connect to a compute device (clGetPlatformIDs)
2.
Get a device of the appropriate type (CPU or GPU) (clGetDeviceIDs)
3.
Create a compute context (clCreateContext)
4.
Create a command queue (clCreateCommandQueue)
5.
Load the kernels from the files *.cl (clCreateProgramWithSource)
6.
Build the executable program (clBuildProgram)
7.
Create the compute kernel in the program (clCreateKernelsInProgram)
8.
Create the device memory vectors (buffer definition in the context) (clCreateBuffer)
9.
Transfer the input vector into device memory (clEnqueueWriteBuffer)
For the execution time,we need to respect this procedure:
1.
Set the arguments to the compute kernel (clSetKernelArg)
18
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
2.
Get the maximum work group size for executing the kernel on the device (clGetKernelWork-
GroupInfo)
3.
Launch the kernel execution (clEnqueueNDRangeKernel)
4.
Read the output buffer (clEnqueueReadBuffer)
All these steps are required.
19
Chapter III
Results
Contents
III.1 Experimental protocol
...........................
20
III.2 Tests
......................................
21
III.2.1 Time complexity
...............................
21
III.1 Experimental protocol
All the results are obtained using that configuration:
OS:Ubuntu 10.04 lucid
Environment:GNOME 2.30.2 (Ubuntu 2010-06-25)
Platform:x86_64
CPU:Intel(R) Core(TM) i7
RAM:3.9 Go
GPU:ATI Mobility Radeon HD 5800 Series,300MHz
Memory:GDDR5,400 MHz,1024 MB,64 Go/s.
Catalyst(TM) Drivers:
Version Catalyst(TM):10.12
Version 2D driver:8.80.5
Version Catalyst(TM) Control Center:2.13.
Version of our program:1.2.5 (rev.43)
We made a bash script which launch our program with different configurations using the com-
mand line options.For each set of parameters,the script save the results in severals files which
contain the values of the parameters and the measures.The measures are formated in a sepated
file which could be read and plot,for example with Gnuplot (that is what we did).Each time we
20
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
launched a new configuration,we waited the previous had finished.
At the top of the script,we define the list of values for each parameter.Then,the configurations
will be based on these values.
We will evaluate the complexity of the algorithm and sets of parameters to understand the
behaviour of the program.For all the measures we use as input images both images on figure
III.1
.
(a) reference image
(b) moving image
Figure III.1:Input images for the measures (256x256).
The matching map given by the program shows the list of matches where the green lines
represent the lower scores,the blue lines de median scores and the red lines the higher scores (the
best similarities).
III.2 Tests
III.2.1 Time complexity
The notation of the execution time is T.On the figure
III.2
we compare the execution time of
the sequential (at the left) and parralel (at the right) algorithms fonctions of a given parameter.
We can notice the efficiency of the
GPU
programming with the order of magnitude of the ex-
ecution times (all in seconds).The time complexity is polynomial for Δ
1
and Δ
2
parameters and
linear for N.
Now let us see some qualitative results:the figure
III.3
shows the matching map for the input
images defined previously.
21
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
0
5
10
15
20
25
30
35
40
45
50
1
2
3
4
5
6
7
8
9
10
T=f(Delta1)
"results/T-d1/graph.txt"
(a) T = f( Δ
1
)
0.88
0.882
0.884
0.886
0.888
0.89
1
2
3
4
5
6
7
8
9
10
T=f(Delta1)
"results/T-d1/graph.txt"
(b) T = f( Δ
1
)
0
5
10
15
20
25
30
35
40
45
50
1
2
3
4
5
6
T=f(Delta2)
"results/T-d2/graph.txt"
(c) T = f( Δ
2
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
2
3
4
5
6
T=f(Delta2)
"results/T-d2/graph.txt"
(d) T = f( Δ
2
)
20
30
40
50
60
70
80
90
100
110
120
4
5
6
7
8
9
10
T=f(N)
"results/T-N/graph.txt"
(e) T = f( N )
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
4
5
6
7
8
9
10
T=f(N)
"results/T-N/graph.txt"
(f) T = f( N )
0
20
40
60
80
100
120
140
160
180
0
5
10
15
20
25
T=f(Omega)
"results/T-omega/graph.txt"
(g) T = f( Ω )
0
0.5
1
1.5
2
2.5
3
3.5
4
0
5
10
15
20
25
T=f(Omega)
"results/T-omega/graph.txt"
(h) T = f( Ω )
Figure III.2:Time execution functions of several parameters (sequential and parallel results).
Default values:T(sec),N = 6,Ω = 12,Δ
1
= 1,Δ
2
= 1
22
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
(a) Reference image
(b) Moving image
(c) Matching map
Figure III.3:Matching map/List of matches overview.
23
Chapter IV
Discussion
Contents
IV.1 Difficulties
...................................
24
IV.1.1 Learning the GPU programming
......................
24
IV.1.2 Work on ATI device
.............................
24
IV.1.3 Take into account the GPU device constraints
...............
25
IV.2 Considered optimizations and add-ons
..................
26
IV.2.1 Memory access in the kernels
........................
26
IV.2.2 Median variance implementation
......................
26
IV.1 Difficulties
IV.1.1 Learning the GPU programming
The GPUprogramming was something newfor us.We had to learn by ourselves howto get start
with the
OpenCL
library and also how to make an adapted program for the GPU programming.
We had to understand the procedure (Ch.
II.3.3
) detailed on the OpenCL documentation.It is
not an issue but that took a long time to know the causes of the different crashes of the program
because of our unawareness at the begining.
IV.1.2 Work on ATI device
At first,the project was supposed to be developed in
CUDA
but in order to make the local
tests easier,we choosed to work on our personal computers.However,our system is not running
with a NVidia graphic card but with a ATI card.So we proposed to use OpenCL which allows to
take advantage of the portable code (cross device).
24
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
IV.1.3 Take into account the GPU device constraints
We have established from experience to what extent it could be difficult to develop on GPU.
Crash system issue During the implementation,some crashes happened often enough with in
the best case a segmentation fault and in the worst case an system freeze (reboot needed).It has
been our biggest issue because the available means to locate the issue are purely practical and
these steps were laborious.So,the solutions have been to follow step by step the program process.
For example,we found that certain sizes of the images could cause the crash.
Debug methods In order to locate an issue,the first step was to find the nature of the issue
and then to do what is needed to find it.The implementation issues were usually quite easy to
find,all the more so the technical documentations on the
OpenCL
language are rich enough.The
algorithmic issues were not as easy to find and needed more time to be located.
With the aim of finding these issues,we displayed the value of the variables at specific times.
Is was very usefull for the interface validation between both parallel and sequential versions of the
blockmatching algorithm.
In order to print the significant values in the kernels,we used the following preprocessing
directive which is used in the dynamic compiling process of OpenCL:


#
pragma
OPENCL
EXTENSION
cl_amd_printf
:
enable




It enables us to use the printf function as we can find in the standard library stdio.h in the C
language,which is not possible in the CUDA language.However,we have to be attentive about
this method,because it takes long slowdowns,esandpecially when we display too many messages,
the last printed information is inexact and the stream is broken,so the program still run but
nothing more is displayed.
Memory constraints The memory constraints are not yet handled.Indeed if we need to be
careful of one thing about the parallel implementation it would be about the memory management.
We need to get ready to find out the memory issue that could come from memory access costs,
unavailable memory ressources,shared memory control or memory sizes of the different hardware
structure scales of the graphic card.
We do not believe that the current version of the program is robust enough to consider all
possible cases of input images,we do not have tested all cases.What is going wrong currently is
the 3D images handling.Indeed the depth of the medical images is often variant and the
OpenCL
language obviously needs storage buffer sizes multiple of the available size of a
working group
.
25
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
IV.2 Considered optimizations and add-ons
IV.2.1 Memory access in the kernels
In our kernel implementation,the loops contain several memory accesses.The problem is that
the memory used is the global memory of the device.The access time for this memory block is
higher than the access time of the local memory.Each working item read and write everytime the
global memory (__global definitions).
We could optimize the kernels using local storage bases (__local definitions).At the begining
of the kernel,we would copy the interesting local values of (i,j) from the reference image and the
local values of (r,s) from the moving image to local buffers.Then all the following accesses should
be on the local memory of the working goup it would be faster.
IV.2.2 Median variance implementation
During the development of the program we used a good way to dynamicly filter the worst
matches using the median variance computation.So we are able to apply this method on the
sequential algorithm but we did not implement that in the other class.We could see that it brings
an optimization on the computation time and also a more representative list of matches.We need
now to add this method in the kernels.
The principle is to compute the median variance value from the variance image computed in
the initialization of the class.Then,in the matching algorithm,for each (i,j) we compute the
variance of the block B
i,j
and we compare the value with the median variance.If the value is
greater than the median variance we continue otherwise we forget this block.
The final list of matches does not contain the matches of the homogeneous areas which are not
expected (Fig.
IV.1
).
(a) Matching map with the
median variance filter
(b) Matching map without
the median variance filter
Figure IV.1:Median Variance implementation.
26
Conclusion
The image registration algorithm,as we have seen,is made up of the
BMA
.We implemented
this algorithm in two C++ classes,the sequential and the parallel algorithms.With the OpenCL
library,the parallel class can be executed on CPU or GPU.
An interface has been made and validated in order to use both classes in the same way and to
get the same results.
The missing optimizations should increase programperformance.We can accelerate the process
thanks to the local memory of the working groups of the GPU and thanks to the median variance
computation.This last method gives also a more consistent list of matches.
The project has been rewarding.It allowed us to discover the GPU programming by studying
a fascinating subject which presents applications in medical imaging.
The fact that it takes part in the
CSIRO
projects motivated us for the development.
It would be interesting now to implement the second part of the image registration algorithm
and see the results of the
RTE
.It would give the rigid transformation that we expect.The
following steps would be to find the non-rigid transformations.
27
List of Figures
I.1 Anatomical imaging examples using MRI
.......................
3
I.2 Functional imaging examples
..............................
4
I.3 NVidia GPU
.......................................
4
I.4 Internal structure used by OpenCL
...........................
8
I.5 Sequential and parallel version of the same function
.................
9
II.1 Input images.
.......................................
12
II.2 Step of the block matching algorithm.
.........................
12
II.3 UML Diagram (using Umbrello)
............................
13
II.4 Medical 3D image viewer
................................
15
II.5 Dicom data elements.
..................................
16
III.1 Input images for the measures (256x256).
.......................
21
III.2 Time execution functions of several parameters (sequential and parallel results).
Default values:T(sec),N = 6,Ω = 12,Δ
1
= 1,Δ
2
= 1
...............
22
III.3 Matching map/List of matches overview.
.......................
23
IV.1 Median Variance implementation.
............................
26
28
List of Algorithms
1 General Block Matching algorithm
...........................
16
2 Sequential Block Matching algorithm (detailed)
....................
17
29
Bibliography
[1]
CImg official website.
http://cimg.sourceforge.net/
,October 2010.
[2]
CMake official website.
http://www.cmake.org/
,October 2010.
[3]
CUDA language official website.
http://www.nvidia.com/object/cuda_home_new.html
,
December 2010.
[4]
Doxygen official website.
http://www.doxygen.org/
,December 2010.
[5]
Subversion official website.
http://subversion.apache.org/
,October 2010.
[6]
Programming in OpenCL.
http://developer.amd.com/zones/OpenCLZone/programming/Pages/default.aspx
,
January 2011.
[7]
CSIRO official website.
http://www.csiro.au/
,January 2011.
[8]
Dicom format official website.
http://medical.nema.org/
,January 2011.
[9]
GPGPU Developer Resources.
http://gpgpu.org/developer
,January 2011.
[10]
Medcon official website.
http://xmedcon.sourceforge.net/
,January 2011.
[11]
Brian Avants,Murray Grossman,and James Gee.Symmetric Diffeomorphic Image Registra-
tion:Evaluating Automated Labeling of Elderly and Neurodegenerative Cortex and Frontal
Lobe.pages 50–57.2006.
[12]
Jérôme Yelnik,Eric Bardinet,Didier Dormont,Grégoire Malandain,Sébastien Ourselin,Do-
minique Tandé,Carine Karachi,Nicholas Ayache,Philippe Cornu,and Yves Agid.A three-
dimensional,histological and deformable atlas of the human basal ganglia.i.atlas construction
based on immunohistochemical and MRI data.Neuroimage,34(2):618–38,January 2007.
30
Appendix A
Getting start
Requirements
Some libraries are needed to compile the project:you will find lines about how to get/install
them.All the command lines was executed on Ubuntu 10.04 Lucid.

CMake and CCMake
:


$
sudo
apt
-
get
install
cmake
cmake
-
curses
-
gui





Subversion
:


$
sudo
apt
-
get
subversion





CImg
:Needs libX11,libmath and libpthread installed.

MedCon
:See
http://xmedcon.sourceforge.net/Main/Faq#AC
.

libboost_program_options
:


$
sudo
apt
-
get
install
libboost
-
program
-
options
-
dev





OpenCL
:
It is not easy to install this lib.First you must know which kind of GPU device you have.If
it is a NVidia device you must download the CUDA version of OpenCL,otherwise,it is an
ATI GPU device,in this case,you must download the ATI Stream version of OpenCL.
The second thing to check before to install is if your GPU drivers are installed.I do not know
how it could be on NVidia platforms but on ATI you must need to install the ATI Catalyst
31
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
drivers (download those drivers from the official website is pretty easy).
Then you can install OpenCL,a standard installation is much easier.Take care about the
version of your distrib (32 - 64).I choosed to install this library localy,here is the protocol
I followed:

configure and build OpenCL in a specific folder (ATISTREAMSDKROOT).

in your bash files,update the corresponding variables:


ATISTREAMSDKROOT
=
YOUR_PATH
/
ati
-
stream
-
sdk
-
v2
.3-
lnx64
ATISTREAMSDKSAMPLESROOT
=
YOUR_PATH
/
ati
-
stream
-
sdk
-
v2
.3-
lnx64
export
LD_LIBRARY_PATH
=
$ATISTREAMSDKROOT
/
lib
/
x86_64
export
LIBRARY_PATH
=
$
{
LIBRARY_PATH
}:
$ATISTREAMSDKROOT
/
lib
/
x86_64
export
CPLUS_INCLUDE_PATH
=
$
{
C_INCLUDE_PATH
}:
$ATISTREAMSDKROOT
/
include
export
C_INCLUDE_PATH
=
$
{
C_INCLUDE_PATH
}:
$ATISTREAMSDKROOT
/
include




This enables the linker to do its job.You can forget this step but you will need to specify
the paths into the projectconfig.cmake file.
Get the sources
It is very simple,you just need to download the lastest version of the sources using subversion
as usually:


$
svn
co
https
:
//
forge
.
clermont
-
universite
.
fr
/
svn
/
projet
-
gpu
-
reg
/
trunk
project
$
cd
project




Compile the project
Configure CMake


$
ccmake
.
EMPTY
CACHE
-->
press
"
c
"
then
"
g
"
$
make




32
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
Launch the compiler


$
make




If you use none standard installation for one or severals libraries,you may need to set the
variables in projectconfig.cmake file (include & lib folders).
Interface
Images Requirements
The images have to have the same format and the same size (tif,jpg,png,dcm,nii).As the
GPU module uses working groups we decided to use only muliple of 2 sizes but if needed the
program resize the images with a warning message.
Arguments
The algorithmparameters are defined using –N,–OMEGA,–DELTA1,–DELTA2 and –PADDING
options according to the parameters defined in this report (List of symbols p.
iv
).
Interactive mode
The interactive mode allows the user to set dynamically the parameters and to display the
results using a simple console menu.
Output files
The output files are:

A txt file which contains the list of matches,the execution time,the sum of the scores and
the parameters used to generate tese results.

Two images which represent the matches.
33
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
Summary


Allowed
options
:
-
h
[ --
help
]
produce
this
help
message
-
v
[ --
version
]
print
version
of
the
program
--------- --------------------------------------
--
N
arg
(=4)
set
the
size
of
a
block
of
pixels
--
OMEGA
arg
(=20)
set
the
range
for
the
search
area
--
DELTA1
arg
(=5)
set
the
amount
of
pixels
to
skip
between
2
points
on
the
reference
image
--
DELTA2
arg
(=1)
set
the
amount
of
pixels
to
skip
between
2
points
on
the
moving
image
--
PADDING
arg
(=25)
set
the
range
of
pixels
to
skip
from
the
sides
of
the
images
--------- --------------------------------------
-
i
[ --
input
]
arg
set
the
reference
and
moving
image
filenames
:
<
moving
(
data
/
sag1
.
tif
)> <
reference
(
data
/
sag
2.
tif
)>
--------- --------------------------------------
-
p
[ --
parallel
]
arg
(=1)
set
the
exec
mode
arg
= 1:
parallel
arg
= 0:
sequential
-
t
[ --
parallel
-
type
]
arg
(=1)
set
the
exec
mode
type
arg
= 1:
GPU
arg
= 0:
CPU
-
a
[ --
interactive
]
arg
(=0)
set
the
interactive
mode
arg
= 1:
active
arg
= 0:
disactive
-
o
[ --
output
]
arg
set
the
output
file
filename




34
Appendix B
Kernels implementation


//
__attribute__
((
vec_type_hint
(
float4
)
)
)
//
#
pragma
OPENCL
EXTENSION
cl_amd_printf
:
enable
__kernel
void
kMeanVariance
(
__global
float
*
inputL
,
__global
float
*
inputR
,
__global
float
*
outputLMean
,
__global
float
*
outputRMean
,
__global
float
*
outputLVariance
,
__global
float
*
outputRVariance
,
const
uint
w
,
const
uint
h
,
const
uint
d
,
const
int
N
)
{
int
ig
=
get_global_id
(0);
float
max1
= 0.0;
float
max2
= 0.0;
float
mean1
= 0.0;
float
mean2
= 0.0;
int
i
= 0;
int
j
= 0;
int
k
= 0;
int
I
;
int
J
;
int
K
;
int
wh
=
w
*
h
;
float
diff
;
float
normalisation
;
if
(
ig
<
wh
*
d
)
{
//
Initialization
of
the
matrices
(
edges
)
normalisation
= (
float
)((
d
>1)?(
N
+1)*(
N
+1)*(
N
+1):(
N
+1)*(
N
+1));
outputLMean
[
ig
] = 1.0/
normalisation
;
35
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
outputRMean
[
ig
] = 1.0/
normalisation
;
outputLVariance
[
ig
] = 1.0/
normalisation
;
outputRVariance
[
ig
] = 1.0/
normalisation
;
//
Set
the
indexes
i
=
ig
\%
wh
;
I
=
i
\%
w
;
J
=
i
/
w
;
K
=
ig
/
wh
;
//
Check
the
edges
if
(
I
<
N
/2 ||
I
>
w
-
N
/2-1)
return
;
if
(
J
<
N
/2 ||
J
>
h
-
N
/2-1)
return
;
if
(
d
>1 && (
K
<
N
/2 ||
K
>
d
-
N
/2-1))
return
;
if
(
d
< 1 )
return
;
mean1
= 0.0;
mean2
= 0.0;
max1
= 0.0;
max2
= 0.0;
//
Compute
the
Mean
matrix
if
(
d
>1 )
{
//
[...]
}
else
{
for
(
j
=
J
-
N
/2;
j
<=
J
+
N
/2;
j
++)
for
(
i
=
I
-
N
/2;
i
<=
I
+
N
/2;
i
++)
{
mean1
+=
inputL
[
j
*
w
+
i
];
mean2
+=
inputR
[
j
*
w
+
i
];
}
}
mean1
=
mean1
/
normalisation
;
mean2
=
mean2
/
normalisation
;
outputLMean
[
ig
] = (
float
)
mean1
;
outputRMean
[
ig
] = (
float
)
mean2
;
//
Compute
the
Variance
matrix
if
(
d
>1 )
{
//
[...]
36
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
}
else
{
for
(
j
=
J
-
N
/2;
j
<=
J
+
N
/2;
j
++)
for
(
i
=
I
-
N
/2;
i
<=
I
+
N
/2;
i
++)
{
diff
= (
inputL
[
j
*
w
+
i
] -
mean1
);
max1
+=
diff
*
diff
;
diff
= (
inputR
[
j
*
w
+
i
] -
mean2
);
max2
+=
diff
*
diff
;
}
}
max1
=
max1
/
normalisation
;
max2
=
max2
/
normalisation
;
outputLVariance
[
ig
] = (
float
)(
max1
);
outputRVariance
[
ig
] = (
float
)(
max2
);
}
//
Finish
the
threads
barrier
(
CLK_LOCAL_MEM_FENCE
);
}






//
__attribute__
((
vec_type_hint
(
float4
)
)
)
//
#
pragma
OPENCL
EXTENSION
cl_amd_printf
:
enable
__kernel
void
kMatch
(
__global
float
*
imgL
,
__global
float
*
imgR
,
__global
float
*
meanL
,
__global
float
*
meanR
,
__global
float
*
varL
,
__global
float
*
varR
,
__global
uint
*
kmatches
,
__global
float
*
scores
,
const
uint
w
,
const
uint
h
,
const
uint
d
,
const
int
N
,
const
int
OMEGA
,
const
int
padding
,
const
int
DELTA2
)
{
int
ig
=
get_global_id
(0);
int
wh
=
w
*
h
;
float
score
= 0.0;
float
moyg
;
37
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
float
moyd
;
int
i
;
int
j
;
int
k
;
int
r
;
int
s
;
int
t
;
int
r2
;
int
s2
;
int
t2
;
float
d1
;
float
d2
;
float
v1
;
float
v2
;
float
imR1Val
;
float
imR2Val
;
float
sc
;
float
bestscore
= 0.0;
float
normalisation
;
uint
bestMatch
= 0;
if
(
ig
<
wh
*
d
)
{
normalisation
= (
float
)((
d
>1)?(
N
+1)*(
N
+1)*(
N
+1):(
N
+1)*(
N
+1));
//
Set
the
indexes
r
=
ig
%
wh
;
i
=
r
%
w
;
j
=
r
/
w
;
k
=
ig
/
wh
;
bestMatch
=
ig
;
//
Check
the
edges
if
(
!((
i
<
padding
||
i
>
w
-
padding
-1)||
(
j
<
padding
||
j
>
h
-
padding
-1)||
(
d
>1 && (
k
<
padding
||
k
>
d
-
padding
-1))||
(
d
< 1 ))
)
{
//
Processing
if
(
d
== 1 )
38
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
{
//
get
the
means
moyg
=
meanL
[
ig
];
d1
= (
imgL
[
j
*
w
+
i
] -
moyg
);
//
get
the
left
variance
v1
=
sqrt
(
varL
[
j
*
w
+
i
]);
if
(
v1
> 0.0 )
{
for
(
s
=
j
-
OMEGA
;
s
<=
j
+
OMEGA
;
s
+=
DELTA2
)
for
(
r
=
i
-
OMEGA
;
r
<=
i
+
OMEGA
;
r
+=
DELTA2
)
{
score
= 0.0;
moyd
=
meanR
[
s
*
w
+
r
];
//
Compute
the
score
for
(
s2
=
s
-
N
/2;
s2
<=
s
+
N
/2;
s2
++)
for
(
r2
=
r
-
N
/2;
r2
<=
r
+
N
/2;
r2
++)
{
imR1Val
=
imgR
[
s2
*
w
+
r2
];
imR2Val
=
varR
[
s2
*
w
+
r2
];
d2
=(
imR1Val
-
moyd
);
v2
=
sqrt
(
imR2Val
);
sc
=
d1
*
d2
/(
v1
*
v2
);
score
+=
sc
;
}
score
=
score
/
normalisation
;
//
keep
the
match
?
if
(
score
>=
bestscore
)
{
bestscore
=
score
;
bestMatch
=(
uint
)(
s
*
w
+
r
);
}
}
}
39
Jérémy Coatelen & Yu Qin - Image registration on GPU ISIMA
}
else
{
//
[...]
}
}
//
Save
the
results
kmatches
[
ig
] = (
uint
)
bestMatch
;
scores
[
ig
] = (
float
)
bestscore
;
}
barrier
(
CLK_LOCAL_MEM_FENCE
);
}




40