General Purpose Computing on Graphics Processing Units: Optimization Strategy

coleslawokraSoftware and s/w Development

Dec 1, 2013 (3 years and 9 months ago)

113 views

General Purpose Computing on Graphics
Processing Units: Optimization Strategy

Henry Au

Space and Naval Warfare Center Pacific

henry.au@navy.mil

09/12/12

Distribution Statement

Outline


Background


NVIDIA’s CUDA


Decomposition & Porting


CUDA Optimizations


GPU Results


Conclusion

9/12/12

2

Background


Parallel Programming on GPUs


General
-
Purpose Computation on
Graphics Processing Units (GPGPU)


Compute Unified Device Architecture
(CUDA)


Open Computing Language
(
OpenCL
TM
)


9/12/12

3

Background


GPUs vs. CPUs


GPU and CPU cores not the same


CPU core is faster and more robust but, fewer cores


GPU not as robust nor fast, but handles repetitive tasks quickly


NVIDIA
GeForce

GTX 470


448 cores


Memory Bandwidth = 133.9 GB/sec


544.32 GFLOPS DP


Intel Core i7
-
965


4 cores


Memory Bandwidth = 25.6 GB/sec


69.23 GFLOPS DP

9/12/12

4

CUDA by NVIDIA


Compute Unified Device Architecture


Low and High Level API available


C for CUDA


High latency memory transfers


Limited Cache


Scalable programming model


Requires NVIDIA graphics cards


9/12/12

5

Decomposition and Porting


Amdhal’s

and Gustafson’s Law


Estimate Speed Up


P amount of parallel scaling achieved


γ

is the fraction of algorithm that is serial


9/12/12

6

Decomposition and Porting


TAU Profile


Determine call paths and consider subroutine calls


Pay attention to large
for loops

or redundant computations


Visual Studio 2008


Initialize Profile: TAU_PROFILE(“
StartFor
”, “Main”, TAU_USER);


Place Timers:


TAU_START(“
FunctionName
”)


TAU_STOP(“
FunctionName
”)


9/12/12

7

Decomposition and Porting


CUDA Overhead


High latency associated with memory transfers


Can be hidden with large amounts of mathematical computations


Reduce Device to Host memory transfers


Many small transfers vs. fewer but larger transfers


Perform serial tasks using parallel processors


9/12/12

8

CUDA Optimizations


Thread and Block Occupancy


Varies depending on graphics card


Page Locked Memory


cudaHostAlloc
()


Limited resource and should not be overused


Streams


A queue of GPU operations


Such as GPU computation “kernels” and memory copies


Asynchronous Memory Calls


Ensure non
-
blocking calls


cudaMemcpyAsync
() or kernel call

9/12/12

9

Thread Occupancy


Ensure enough threads are operating at the same time


256 threads per block


Max 1024 threads per block


Monitor occupancy

9/12/12

10

ALF Threads Per Block vs Frames Per Second
4
16
1024
64
144
36
196
256
324
400
484
576
40
45
50
55
60
65
70
75
80
85
90
0
200
400
600
800
1000
1200
Threads Per Block
ALF Frames Per Second Processed
ALF FPS
CUDA Optimizations


Page Locked Host Memory


cudaHostAlloc
() vs.
malloc

vs. new


9/12/12

11

Processing Time Vs. Mega Bytes Data Processed
0
2
4
6
8
10
12
14
0
2
4
6
8
10
12
14
Data (MB)
Processing Time (ms)
New
Malloc
cudaHostAlloc
CUDA Optimizations


Stream Structure Non
-
Optimized


Processing time: 49.5ms


9/12/12

12

cudaMemcpyAsync
(dataA0, stream0,
HostToDevice
)

cudaMemcpyAsync
(dataB0, stream0,
HostToDevice
)

kernel<<< blocks, threads, stream0>>>(result0, dataA0, dataB0
)

cudaMemcpyAsync
(result0, stream0,
DeviceToHost
)

cudaMemcpyAsync
(dataA1, stream1,
HostToDevice
)

cudaMemcpyAsync
(dataB1, stream1,
HostToDevice
)

kernel<<<blocks, threads, stream1>>>(result1, dataA1, dataB1)

cudaMemcpyAsync
(result1, stream1,
DeviceToHost
)

CUDA Optimizations


Stream Structure Optimized


Processing time: 49.4ms

9/12/12

13

cudaMemcpyAsync
(dataA0, stream0,
HostToDevice
)

cudaMemcpyAsync
(dataA1, stream1,
HostToDevice
)

cudaMemcpyAsync
(dataB0, stream0,
HostToDevice
)

cudaMemcpyAsync
(dataB1, stream1,
HostToDevice
)

kernel<<< blocks, threads, stream0>>>(result0, dataA0, dataB0)

kernel<<<blocks, threads, stream1>>>(result1, dataA1, dataB1)

cudaMemcpyAsync
(result0, stream0,
DeviceToHost
)

cudaMemcpyAsync
(result1, stream1,
DeviceToHost
)

CUDA Optimizations


Stream Structure Optimized & Modified


Processing time: 41.1ms

9/12/12

14

cudaMemcpyAsync
(dataA0, stream0,
HostToDevice
)

cudaMemcpyAsync
(dataA1, stream1,
HostToDevice
)

cudaMemcpyAsync
(dataB0, stream0,
HostToDevice
)

cudaMemcpyAsync
(dataB1, stream1,
HostToDevice
)

kernel<<< blocks, threads, stream0>>>(result0, dataA0, dataB0)

cudaMemcpyAsync
(result0, stream0,
DeviceToHost
)

kernel<<<blocks, threads, stream1>>>(result1, dataA1, dataB1)

cudaMemcpyAsync
(result1, stream1,
DeviceToHost
)

CUDA Optimizations


Stream Structure not always beneficial


Overhead could result in performance reduction


Profile to determine kernel execution vs. data transfer


NVIDIA Visual Profiler


cudaEventRecord
()

9/12/12

15

GPU Results

9/12/12

16

ALF Processing Speed (Frames Per Second) vs. Optimization Stage
1, 67.05
4, 85.51
0, 65.14
3, 82.88
2, 81.74
60.00
65.00
70.00
75.00
80.00
85.00
90.00
0
1
2
3
4
5
Optimization Stage
Processing Speed (FPS)
FPS

Optimization Stages


0: No Optimizations



(65 FPS)


1: Page Locking Memory


(67 FPS)


2: Asynchronous GPU calls

(81 FPS)


3: Non
-
optimized Streaming

(82 FPS)


4: Optimized Streaming


(85 FPS)

GPU Results


ALF CPU vs. GPU Processing

9/12/12

17

Adaptive Linear Filter FPS Processing Vs. Image Height
624, 92.78
720, 85.51
720, 67.78
624, 77.64
1248, 20.05
1440, 17.07
1872, 8.92
2160, 7.59
2160, 12.91
1872, 15.23
1248, 31.95
1440, 26.13
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
500
700
900
1100
1300
1500
1700
1900
2100
2300
Image Height (4:3 Aspect)
Frames Per Second
CPU FPS
GPU GPS
Conclusion


Test various thread per block allocations


Use page locked memory for data transfers


Asynchronous memory transfer and non
-
blocking calls


Ensure proper coordination of streams


Data Parallelism and Task Parallelism

9/12/12

18

QUESTIONS?

9/12/12

19

References


Amdahl, G., "Validity of the single processor approach to
achieving large scale computing capabilities."
AFIPS Spring
Joint Computer Conference
, 1976.


CUDA C Best Practices Guide
Ver

4.0,
5/2011.


Gustafson, J., "Reevaluating Amdahl's Law."
Communications
of the ACM
, Vol. 31 Number 5, May 1988.


Jason
Sanders, Edward
Kandrot
. CUDA By Example, An
Introduction to General
-
Purpose GPU Programming. Addison
-
Wesley. Copyright NVIDIA Corporation 2011
.


NVIDIA CUDA Programming Guide Ver 4.0, 5/6/2011.


Tau
-
User Guide. Department of Computer and Information
Science, University of Oregon Advanced Computing
Laboratory. 2011

9/12/12

20