GPU Acceleration in Registration

sizzlepictureSoftware and s/w Development

Dec 2, 2013 (3 years and 10 months ago)

77 views

GPU Acceleration in Registration

Danny Ruijters

26 April 2007

GPU Acceleration in Registration, Danny Ruijters

2

Outline


The GPU


Rigid 3D
-
3D Registration


Elastic Registration


Conclusions

GPU Acceleration in Registration, Danny Ruijters

3

The GPU

GPU Acceleration in Registration, Danny Ruijters

4

The graphics card



Raserization of primitives



Texture mapping



Colour interpolation

GPU Acceleration in Registration, Danny Ruijters

5

The GPU


Graphics Processing Unit


Programmable processor in the graphics
rendering pipeline


Parallel execution (SIMD like)


GPU Acceleration in Registration, Danny Ruijters

6

on
-
chip cache memory



















video memory



















system
memory




rasterization

CPU

vertex
shading

(T&L)

triangle setup

fragment
shading

and

raster
operations

textures

frame buffer

geometry

commands

pre
-
TnL
cache

post
-
TnL cache

texture
cache

Graphics rendering pipeline

GPU Acceleration in Registration, Danny Ruijters

7

Bottlenecks

on
-
chip cache memory



















video memory



















system
memory




rasterization

CPU

vertex
shading

(T&L)

triangle setup

fragment
shading

and

raster
operations

textures

frame buffer

geometry

commands

pre
-
TnL
cache

post
-
TnL cache

texture
cache

transform
limited

fragment
shader
limited

CPU
limited

texture
limited

frame buffer limited

setup
limited

raster
limited

transfer
limited

GPU Acceleration in Registration, Danny Ruijters

8

128 processing
units

Local cache

Shared memory

GPU Acceleration in Registration, Danny Ruijters

9

Performance

GPU Acceleration in Registration, Danny Ruijters

10

Performance


Parallelism & pipelining (up to 16 parallel pipelines)


Vector processor


Moore’s Law: CPU: 2* performance per 18 months


GPU: 2* performance per 6 months


GeForce 7900 GTX

GeForce 8800 GTX

Code name

G71

G80

Release date

3 / 2006

11 / 2006

Transistors

278 M (90 nm)

681 M (90 nm)

Clock speed

650 MHz

1350 MHz

Processing units

24+8 (pixel + vertex)

128 (unified)

Peak pixel fill rate

10.4 Gigapixels/s

36.8 Gigapixels/s

Peak memory bandwidth

51.2 GB/s (256 bit)

86.4 GB/s (384 bit)

Memory

512 MB

768 MB

Peak performance

250 Gigaflops

520 Gigaflops

GPU Acceleration in Registration, Danny Ruijters

11

Textures & buffers


1D, 2D, 3D textures



2D output buffers (frame buffer, accumulation buffer,
stencil buffer, p
-
buffer)



8, 10, 12, 16 bit integers, 16, 32 bit floating point



1 (intensity), 2 (luminance
-
alpha), 3 (RGB), 4 (RGBA)
components per pixel

GPU Acceleration in Registration, Danny Ruijters

12

Historic overview GPU


RenderMan (1988, pre
-
history)


Intel MMX (SIMD, 1997, pre
-
history)


Register combiners (nVidia, 1999, bronze age)


Vender specific APIs (2001, iron age)


Generic assembly
-
like language (2002, middle
-
ages)


Different high
-
level languages (2003, industrial age)


CUDA: general purpose C
-
like language (2007, modern age)

GPU Acceleration in Registration, Danny Ruijters

13

Register combiners (1999, bronse age)

// Stage 0

// spare0.rgb = gradient dot ViewDir, spare1.rgb =
-
(gradient dot ViewDir)

glCombinerInputNV
(
GL_COMBINER0_NV
,
GL_RGB
,
GL_VARIABLE_A_NV
,
GL_TEXTURE0_A
RB
,
GL_EXPAND_NORMAL_NV
,
GL_RGB
);

glCombinerInputNV
(
GL_COMBINER0_NV
,
GL_RGB
,
GL_VARIABLE_B_NV
,
GL_CONSTANT_
COLOR1_NV
,
GL_EXPAND_NORMAL_NV
,
GL_RGB
);

glCombinerInputNV
(
GL_COMBINER0_NV
,
GL_RGB
,
GL_VARIABLE_C_NV
,
GL_TEXTURE0_A
RB
,
GL_EXPAND_NEGATE_NV
,
GL_RGB
);

glCombinerInputNV
(
GL_COMBINER0_NV
,
GL_RGB
,
GL_VARIABLE_D_NV
,
GL_CONSTANT_
COLOR1_NV
,
GL_EXPAND_NORMAL_NV
,
GL_RGB
);

glCombinerOutputNV
(
GL_COMBINER0_NV
,
GL_RGB
,
GL_SPARE0_NV
,
GL_SPARE1_NV
,
GL_
DISCARD_NV
,
GL_NONE
,
GL_NONE
,
GL_TRUE
,
GL_TRUE
,
GL_FALSE
);


GPU Acceleration in Registration, Danny Ruijters

14

GL_ARB_fragment_program (2002)

!!ARBfp1.0


ATTRIB coord = fragment.texcoord[0];

ATTRIB color = fragment.color;

OUTPUT out = result.color;

TEMP texel;

TEMP lookup;


TEX texel, coord, texture[0], 3D;

TEX lookup, texel, texture[1], 1D;


MUL out, lookup, color;

END

GPU Acceleration in Registration, Danny Ruijters

15

GLSlang (2003)

uniform vec3
ViewDir
;


void

main (
void
)

{


float

value
;


vec3 gradient;


gradient = texture3(0, gl_TexCoord0) * 2.0
-

1.0;


value

= 1.0
-

abs
(dot(gradient,
ViewDir
));


value

*= 1.3 * dot(gradient, gradient);


value

= clamp(
value
, 0.0, 1.0);


gl_FragColor = vec4(
value
);

}

GPU Acceleration in Registration, Danny Ruijters

16

CUDA (2007)


C
ompute
U
nified
D
evice
A
rchitecture


General purpose C
-
like language


nVidia only


Very recently released

GPU Acceleration in Registration, Danny Ruijters

17

Rigid 3D
-
3D Registration

GPU Acceleration in Registration, Danny Ruijters

18

3DRA


MR registration

GPU Acceleration in Registration, Danny Ruijters

19

3DRA


XperCT Registration 1

Pre
-
operative

GPU Acceleration in Registration, Danny Ruijters

20

3DRA


XperCT Registration 2

Post
-
operative:

verification of the
embolization

GPU Acceleration in Registration, Danny Ruijters

21

3DRA Slice

GPU Acceleration in Registration, Danny Ruijters

22

Mutual information

F. Maes et al., "Multimodality Image Registration by Maximization of Mutual Information,“

IEEE Transactions on Medical Imaging 16(2), pp. 187
-
198, April 1997

GPU Acceleration in Registration, Danny Ruijters

23

Joint histogram

GPU Acceleration in Registration, Danny Ruijters

24

Resampling

Joint histogram: increment(
g
,
g
)

GPU Acceleration in Registration, Danny Ruijters

25

3DRA


MR, before, after

GPU Acceleration in Registration, Danny Ruijters

26

3DRA


MR: CPU interpolation

GPU Acceleration in Registration, Danny Ruijters

27

3DRA


MR: GPU interpolation

GPU Acceleration in Registration, Danny Ruijters

28

Elastic Registration

GPU Acceleration in Registration, Danny Ruijters

29

Elastic deformation


Parameterized deformation:




B
-
spline deformation:

GPU Acceleration in Registration, Danny Ruijters

30

Cubic B
-
spline

GPU Acceleration in Registration, Danny Ruijters

31

GPU linear interpolation


Hardwired: linear interpolation is much faster
than separate lookups

GPU Acceleration in Registration, Danny Ruijters

32

GPU Cubic Interpolation


Compose cubic interpolation from weighted
sum of linear interpolations:

=

C. Sigg, M. Hadwiger, “Fast Third
-
Order Texture Filtering”, GPU Gems 2

GPU Acceleration in Registration, Danny Ruijters

33

Outline of proof

=

GPU Acceleration in Registration, Danny Ruijters

34

GPU Cubic Interpolation


2D: 4 linear
-
interpolated lookups, instead of
16 direct lookups



3D: 8 linear
-
interpolated lookups, instead of
64 direct lookups

GPU Acceleration in Registration, Danny Ruijters

35

GPU Linear Interpolation Accuracy

nVidia QuadroFX 3500
-1
0
1
2
3
4
5
6
7
8
9
10
1
13
25
37
49
61
73
85
97
109
121
133
145
157
169
181
193
205
217
229
241
253
Error * -10^-8
GPU Acceleration in Registration, Danny Ruijters

36

Linear deformation, linear interpolation

GPU Acceleration in Registration, Danny Ruijters

37

Linear deformation, cubic interpolation

GPU Acceleration in Registration, Danny Ruijters

38

Cubic deformation, linear interpolation

GPU Acceleration in Registration, Danny Ruijters

39

Cubic deformation, cubic interpolation

GPU Acceleration in Registration, Danny Ruijters

40

Optimization


Many parameters: huge parameter space


Solution: use derivatives like Jacobian, Hessian


Examples: Gradient Descent, Quasi
-
Newton,
Levenberg
-
Marquardt

GPU Acceleration in Registration, Danny Ruijters

41

GPU Elastic Registration Iteration

1.
Generate deformed image on GPU & store
to texture

2.
Calculate Similarity Measure & First
-
Order
Derivative on GPU


Texture with reference image


Texture with deformed image

GPU Acceleration in Registration, Danny Ruijters

42

First
-
Order Derivative of Sim. Measure

J. Kybic, M. Unser, “Fast Parametric Elastic Image Registration”

GPU Acceleration in Registration, Danny Ruijters

43

Derivative of the Similarity Measure

SSD:

GPU Acceleration in Registration, Danny Ruijters

44

Derivative of the Deformed Image


Sobel operator to calculate gradients:

-
1

0

1

-
4

0

4

-
1

0

1

1

4

1

0

0

0

-
1

-
4

-
1

GPU Acceleration in Registration, Danny Ruijters

45

Derivative of the Control Points


Constant


B
-
spline: separatable kernel of fixed size

GPU Acceleration in Registration, Danny Ruijters

46

Original Fluoroscopy Sequence

GPU Acceleration in Registration, Danny Ruijters

47

2 * 2 Control Points

GPU Acceleration in Registration, Danny Ruijters

48

8 * 8 Control Points

GPU Acceleration in Registration, Danny Ruijters

49

Deformation Field

GPU Acceleration in Registration, Danny Ruijters

50

GPU Elastic Registration


40 images: Quasi Newton: 16 seconds


Gradient Descent: 63 seconds


8 * 8 Control Points: rest motion


Multi
-
resolution deformation field, with
reduced parameters (discussed with Dirk
Loeckx)

GPU Acceleration in Registration, Danny Ruijters

51

CUDA Libraries

GPU Acceleration in Registration, Danny Ruijters

52

CUDA Software Stack

GPU Acceleration in Registration, Danny Ruijters

53

CUDA Libraries


CUBLAS


CUFFT

GPU Acceleration in Registration, Danny Ruijters

54

CUBLAS


Basic Linear Algebra Subprograms


Vector, Matrix, Numerical Math


Almost no initialization


Function calls

GPU Acceleration in Registration, Danny Ruijters

55

CUBLAS performance

execution times scalar vector add dual-core Woodcrest and G80 core
0
50
100
150
200
250
300
350
400
450
500
0
500
1000
1500
2000
2500
3000
3500
4000
4500
data size vector (kB)
execution time (ms)
G80 (ms)
Woodcrest (ms)
GPU Acceleration in Registration, Danny Ruijters

56

CUBLAS performance

execution times vector inproduct dual-core Woodcrest and G80 core
0.0000
50.0000
100.0000
150.0000
200.0000
250.0000
300.0000
350.0000
400.0000
450.0000
500.0000
0
500
1000
1500
2000
2500
3000
3500
4000
4500
data size vector (kB)
execution time (ms)
G80 (ms)
Woodcrest (ms)
GPU Acceleration in Registration, Danny Ruijters

57

CUFFT performance

execution times 2D FFT single-core Woodcrest and G80 core
(size 2^n)
0.001
0.01
0.1
1
10
100
1000
10000
1
10
100
1000
10000
N point 2D FFT
Execution time (ms)
G80 (CudaFFT)
Woodcrest (FFTW)
GPU Acceleration in Registration, Danny Ruijters

58

Conclusion & Future work

GPU Acceleration in Registration, Danny Ruijters

59

Conclusions


GPU: powerful parallel processor, but has its
limitations


Rigid Registration: interpolation on the GPU


Elastic Registration: calculation of the
Similarity Measure & first order derivative on
the GPU

GPU Acceleration in Registration, Danny Ruijters

60

Future work


Multi
-
resolution deformation fields


2D
-
3D registration of the Coronary Arteries
(not presented)

GPU Acceleration in Registration, Danny Ruijters

61

Questions?