ENEL428 - Computer Software Engineering 2

useumpireΛογισμικό & κατασκευή λογ/κού

2 Δεκ 2013 (πριν από 4 χρόνια και 29 μέρες)

112 εμφανίσεις






ENEL4
28

-

Computer Software
Engineering 2

Distributed Computing
-

CUDA















by Sam Leichter


0
1/
10
/2012


1


Introduction

Graphics Processing Units (GPUs) are highly parallel structures used in data processing.
While these are currently slower in clock speed than Central Processing Units (CPUs), GPUs
feature single instruction multiple data, allowing many
copies
of the same i
nstruction
to be
executed in parallel
on a data set
,

leading to
a shorter computation time. This report will
investigate the computation time of the 3D Fast Fourier Transform and normalization for
variable array size on the GPU utilizing CUDA and CPU using

FFTW.

The resulting data will be
utilised to make hardware design selections for the controller of the
Van Den Broeck

warp
drive.

Results

and Discussion

Setup time shown below
in Fig 1. is the time taken to allocate

memory an
d
initialize the
data
.

In a re
al use, the data would not be created; it would be loaded from memory due to
its earlier collection. The CPU setup time can clearly be seen to
increase linearly with the
length of the array to be transformed, in comparison with the GPU which remains consta
nt.

This is due to the parallel nature of the GPU, which can initialize the data in parallel while
the CPU has to use an unthreaded for loop to initialize the data.

Of note is the spike in GPU
processing time, this could be due to other load on the GPU, as

graphics for the desktop PC
still have to be rendered.


Figure
1
. FFT initialization time.

GPU scale on left also in
μ
s
.

In Fig. 2 the comparison of the FFT and normalization time for both processing units is
shown. Once again
the CPU setup time can clearly be seen to increase linearly with the
length of the array to be transformed
.
There is much more spread in the FFT operation due
to the CPU not being exclusively allocated to the FFT calculation, having to run the OS and
other

programs at the same time.
This is also true for the GPU but at a much slower rate.
The CPU is more than 10 times s
lower
due to the parallel nature of the GPU, executing the
FFT and normalization in parallel. Most of the time
of the total
execution spent
here for the
both processing units
.

0
50
100
150
200
250
300
350
400
0
2000
4000
6000
8000
10000
12000
14000
16000
0
500000
1000000
1500000
2000000
Execution Size (
μ
s)

FFT Length

Setup Time

CPU SETUP
GPU SETUP
2



Figure
2
. FFT computation and normalization time.

GPU scale on left also in
μ
s
.

In Fig. 3 the comparison of the teardown times for both processing units is shown. In this
graph the CPU
teardow
n
time
is constant while the GPU is linear with array size. This is due
to the CPU having to only free memory, while the GPU has to
free memory and
write all of
the data back to the main memory on the computer. The time taken to write the memory is
still h
owever shorter than the time taken for the CPU to setup the memory.

Of note the CPU
time to teardown can be seen increasing linearly in execution time to 250000 length arrays
where the execution time flattens out. This would be due to how the memory is al
located in
the first place; it is easier to create contiguous blocks of memory for smaller allocations so
less chunks of memory will need to be freed in the freeing operation.


Figure
3
. FFT teardown time.

GPU scale on left also i
n
μ
s
.

The total time of
FFT
and normalization program
is dominated by the
calculation process
due to the single threaded nature of the CPU
.
Compared to GPU, where the teardown is
slightly less than half of the execution time.
The total execution

time for the computation
is
0
10000
20000
30000
40000
50000
60000
70000
0
100000
200000
300000
400000
500000
600000
700000
0
500000
1000000
1500000
2000000
Execution Size (
μ
s)

FFT Length

FFT Calculation Time

CPU CALC
GPU CALC
0
1000
2000
3000
4000
5000
6000
7000
0
50
100
150
200
250
300
350
0
500000
1000000
1500000
2000000
Execution Size (
μ
s)

FFT Length

FFT Teardown Time

CPU TEAR DWN
GPU TEAR DWN
3


shown below in Fig. 4. It can be seen at the low end of the array size the GPU and CPU times
cross with the CPU taking less time for execution than the GPU.


Figure
4
. Complete FFT calculation time.

GPU scale on left also in
μ
s
.

Due to the write back operation of the GPU shown in Fig. 3, the time to compute the FFT of
the data on the GPU in comparison to the CPU is faster for total execution time once the
length of the array is greater than
64
. This c
an be seen in Fig. 5., where the first
3

data points
for CPU total time are lower than the GPU time for the same sized array.


Figure
5
. Complete FFT calculation time for less than 1000
lengthed FFTs.



0
5000
10000
15000
20000
25000
30000
35000
0
100000
200000
300000
400000
500000
600000
700000
0
500000
1000000
1500000
2000000
Execution Size (
μ
s)

FFT Length

FFT Total Time

CPU TOTAL
GPU TOTAL
0
50
100
150
200
250
300
350
400
450
500
0
50
100
150
200
250
300
350
400
Execution Size (
μ
s)

FFT Length

FFT Total Time (Less than 1000)

GPU TOTAL
CPU TOTAL
4


Conclusion

The GPU based pr
ocessing of the FFT is substantially faster than the CPU for very large
length FFTs. However, when the array length is small, less than
64
, the CPU is faster. This is
due to the GPU having to write the results back into main memory of the computer, a step
not needed by the CPU. The parallel nature of the GPU
keeps the processing time
more than
10x faster than the CPU in calculations,
with the exception of the memory write back
.

I
f the
FFTs used to control the warp drive are to be performed on large sets of
data, a GPU based
setup is preferred as it takes less time.