Coordinating Multiple GPU Devices to Run MASS Applications

perchorangeΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 9 μήνες)

71 εμφανίσεις

Piotr Warczak

CSS 596 /

Spring 2012

Advisor: Munehiro Fukuda


Coordinating Multiple GPU Devices to Run MASS Applications


Problem statement


This project is part of the Multi
-
Agent and Spatial Simulation (MASS )

library, which is one of the three
components for
Sensor Cloud Integration: An Agent
-
Based Workbench for On
-
the
-
Fly Senor
-
Data
Analysis

project.
This project explores CUDA, a parallel computing platform and programming model
invented by NVidia
,

to parall
elize and thus accelerate a simulation of
Schroedinger’s

wave dissemination
over a two
-
dimensional space

also known as Wave2D.
The project will
demonstrate

how the Wave2D
program can be mapped onto the single or multiple GPU devices
and compare two
the di
fferent
methods to coordinate multiple devices.
The Wave2D program was selected as a test program because
it’s one of the Multi
-
Agent and Spatial Simulation (MASS) applications used to test the MASS library.
Although this project is tailored towards
the
W
ave2D, it can be extended or use as the base for the
MASS
-
GPU library.


Reason for choosing this topic


The graphics processing unit (GPU) has become an integral part of today's mainstream computing
systems. Over the past 6 years, there has been a marked increase in the performance and capabilities of
GPUs. The modern GPU is not only a powerful graphics engi
ne but also a highly parallel programmable
processor featuring peak arithmetic and memory bandwidth that substantially outpaces its CPU
counterpart. The GPU's rapid increase in both programmability and capability has spawned a research
community that has s
uccessfully mapped a broad range of computationally demanding, complex
problems to the GPU. This effort in general
-
purpose computing on the GPU, also known as GPU
computing, has positioned the GPU as a compelling alternative to traditional microprocessors
in high
-
performance computer systems of the future.

Based on the initially GPU computing readings and the current uses of this technology, I have become
extremely interested in
this topic

and wanted to confirmed first
-
hand the performance and capabilitie
s
of GPU devices. In my project I have explored CUDA

platform, which

enables dramatic increases in
computing performance by harnessing the power of the GPU. In order to evaluate performance
increases I’ve implemented the simulation of Schroedinger’s wave

dissemination over a two
dimensional space (Wave2D) in
single threaded,
multithreaded, single GPU and multiple GPUs versions
using C language.
Having all of the applications developed using a single programming language allow
me to compare pure paralleli
zation techniques. Furthermore, h
aving the
single threaded and
multithreaded version of the Wave2D program allowed me to generate a baseline for the performance
evaluation and data validation.


Literature review


The multi process library for multi
-
agent and spatial simulation (MASS) is intended to facilitate entity
based simulation for on
-
the
-
fly sensor data analysis. It reduces the difficulty which comes with mapping
of designers’ algorithms to technology specif
ic implementation such OpenMP, MPI, MapReduce. Each
designer has to learn and apply the technology specific implementation rather than spend their time on
the actual problem they are trying to resolve. MASS reduces the complexity of creating an applicati
on
and mapping its algorithms to specific technology and allows them to focus on their programs’ core
functionality and correctness by abstracting the multi
-
core and multi
-
process implementation details.

The MASS library demonstrates the programming adva
ntages such as clear separation of the simulation
scenario from simulation model and automatic parallelization.
The MASS library would benefit from
a
parallel computing platform and programming model such as CUDA.


All the other papers suggest that utiliz
ing multiple GPUs can speedup programs from tens of times to
hundreds of times depending on the program, how the workload can be divided between GPUs. The
memory management needs to be carefully considered to fully maximize GPU power. It also mentions
t
hat hardware needs to be carefully considered when designing a system with multiple GPU devices.


Wave2D Overview


As mentioned above the Wave2D program is a simulation of
Schroedinger’s

wave dissemination over a
two
-
dimensional

space.

The two dimensional

space is partitioned into N by N cells. A wave is
disseminated north, south, east and west of each cell, and thus each cell needs to

compute

its new
surface height from the previous height of itself and its four neighboring cells.


The water surface can go up and down
between
20.0
and
-
20.0

through the wave dissemination.


At time = 0, the
water surface
needs to be initialized; thus, a
huge tidal wave
is generated
in the middle
of this
water surface
.

All cells(i,j) will be set to 0.
0 with exception of cells(i, j)
where 0.4 * N < i < 0.6 * N,
0.4 * N < j < 0.6 * N.





At time = 1

Schroedinger’s

wave formula computes the following

Zt_i,j = Zt
-
1_i,j + c2 / 2 * (dt/dd)2 * (Zt
-
1_i+1,j + Zt
-
1_i
-
1,j + Zt
-
1_i,j+1 + Zt
-
1_i,j
-
1


4.0 *
Zt
-
1_i,j)

where

c

is the wave speed (set to 1.0)

dt

is a time quantum for simulation

(set to
0.1
)

dd

is a change of the surface (set to 2.0) for all experiments


At time >= 2

the fomulary is the following:

Zt_i,j = 2.0 * Zt
-
1_i,j


Zt
-
2_i,j + c2 * (dt/dd
)2 * (Zt
-
1_i+1,j + Zt
-
1_i
-
1,j + Zt
-
1_i,j+1 + Zt
-
1_i,j
-
1


4.0 * Zt
-
1_i,j)


The

simulation
is initialized at t = 0, and it’s incremented

by

one, and computes the surface height of all
cells(i, j), (i.e., Zt_i,j) at

each time t, based on the above
formulas.


Here are a few snapshots of the Wave2D simulation as it changes over time:







GPU and
CUDA Overview


The GPU, a graphics processing unit, is a massively multithr
ead multicore chip available for example in



computer video ca
rd
s



Playstation 3



XBOX


CUDA
, a Compute Unified Device Architecture,

is a scalable parallel programming model and a software
environment for parallel computing
. Its heterogeneous serial
-
parallel programming model provides
minimal
extensions to
familiar
C and C++
environments.
The c
ombination of both,
GPU Computing with
CUDA brings data
-
parallel computing to the masses as
more than
46 million CUDA
-
capable GPUs have
been
already
sold and a deve
loper kit cost on average $200
. As a result, a
massively parallel computing
has become a co
mmodity technology.
Computing problems which required incredible amount of
computing resources can
today
be solved on a laptop equipped with a GPU


CUDA enabled card.


CUDA provides
a
h
eterogeneous serial
-
parallel programming model

in which CPU and GPU a
re separate
devices with separate
DRAMs.

Serial code executes a host (CPU) thread and parallel kernel code
executes in device threads across multiple processing elements (GPU threads).

One kernel is executed
at a time on the device, and multiple threads e
xecute each kernel. Each thread executes the same code
on a different data based on the CUDA built
-
in threadId object.





CUDA Threads


A grid is composed of blocks which are completely independent and a block is composed of threads
which can
communicate within their own block


CUDA Programming


A typical approach to process data on CUDA is to allocate memory on the CPU and copy it to the GPU
memory. Once the memory is copied, we can start executing CUDA by running kernel methods. After
suc
cessful execution of the kernel methods you need to copy the memory from GPU device back to the
system
memory

(CPU)
.





Single Threaded


In a 100 by 100 simulation space, a single thread
ed version

needs to compute
one of the above formulas
for each cell, which perform the calculation
10,000

times

per simulation time increment.
As you can
imagine this will
consume server resources and
will take
a lot more time to complete the simulation.


Multithreaded


In the mult
ithreaded version we can take advantage of multiple

threads if the server consists of multiple

core
s.

A
nd if
four threads

are
used as I did in my experiment
, the workload of 10,000 cells will be
divided equally
into 2,500 cells per thread.

Consequently,
the
total execution time
will be smaller
and
the server’s resources
will be available much sooner to process other tasks
than in a single threaded
version.

Copy data from
CPU memory to
GPU memory
Run CUDA
Kernels
Copy data from
GPU memory to
CPU memory
CPU
Chipset
CPU
Memory
GPU
Memory
GPU
5
-
10
GB
/
s
50
-
80
GB
/
s
PCIe
5
GB
/
s



This version uses pthreads and barrier synchronization. To compile the code plea
se use the following:

gcc
-
pthread Wave2DThread.c
-
o Wave2DThread


Multithreaded version uses pthreads and the following synchronization methods:



Pthread_barrier_init


initializes the barrier with the specified number of threads



pthread_create


spawns ea
ch thread and specifies the function to be executed



pthread_join


waits for all the threads to finish their assigned work


And the barrier is implemented like this:

// Synchronization point

int rc = pthread_barrier_wait(&barr);

if(rc != 0 && rc != PTHREAD
_BARRIER_SERIAL_THREAD)



Single GPU


However with a single GPU you could calculate each cell by independent CUDA lightweight thread.

CUDA API allows
invoking

hundreds or thousands of threads simultaneously by calling a kernel method.

kernelMethod
<<<
Blocks
,
Threads
>>>

kernelMethod
<<<
100
,
100
>>>


This special notation informs the GPU device to create
100 blocks

and each block will start
100 threads
.
And t
hus,
it will

execute 100 by 100 threads (10,000 threads) in parallel
to
demonstrate

the amazing
performance gain and the power behind GPU computing.



Single
GPU

Results


Device


100
blocks and
100
threads
...
...
...
Simulation
Space
Simulation
Time
4 CPU threads
(
secs
)
Cuda
(
secs
)
Improvement
(%)
100
100
0.012
0.003
443%
300
100
0.055
0.011
501%
500
100
0.149
0.067
224%
1000
100
0.856
0.078
1099%
1500
100
2.050
0.156
1313%
2000
100
6.128
0.351
1748%
3000
100
10.341
0.721
1434%
4000
100
24.510
1.218
2012%
5000
100
31.442
1.853
1696%






0
5
10
15
20
25
30
35
100
300
500
1000
1500
2000
3000
4000
5000
Seconds

Simulation Space N by N

Multithread vs Cuda

Mulithreaded
Cuda
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
100
300
500
1000
Seconds

Simulation Space N x N

Multithreaded vs CUDA

Mulithreaded
Cuda

The graphs above illustrate that Wave2D CUDA version outperforms the multithread version in every
aspect.
Starting from the smallest simulation space 100 by 100 to largest 5000 by 5000, the difference
in total execution time between two versions becomes greater as the simulation space becomes larger
and the number of computation increases.


Multiple GPU devi
ces


To coordinate multiple GPU devices, we need to create a separate host thread per device. The workload
is then divided equally between these devices. The data is staged on the host (CPU) and each host
thread calculates its portion of the array based
on the data index offset.

Each host thread must also
allocated its own memory, which needs to be copied to the GPU. Once the GPU completes its assigned
work, host thread needs to update the main arrays staged on the host and update the other portion
comp
leted by other thread(s). As you can see this involves a lot of
communication

between CPU and
GPU. There is also a large overhead of allocating redundant data on the main arrays (staged), host
thread arrays and GPU arrays.

And the additional step of syn
chronization between the staged data and
each host thread.


There is also another method to achieve the same results by staging the array as in the previous
example in the system memory and pass the host memory pointer to the GPU. In this case the GPU
rea
ds and wr
ites the data on demand and as long as the read/write is done only once in the kernel
method, it should not hidden the communication overhead.





Two methods
to coordinate multiple GPU devices
which I’
ve implemented and tested
are
:

1.

Copy
-
based

a.

Redundantly copy all data and index using global offsets

b.

Requires data to be copied back and forth between CPU (host) and multiple GPU
devices


2.

Zero
-
copy

approach

a.

Retrieves data on demand from the host
and
uses
index
as global offset

b.

Requires data to be
allocated
only once on the CPU (host) and then each GPU device
gets a pointer to the host memory

c.

Device need to support CudaDeviceMapHost flag

d.

And r
equires the memory

to be pinned


Tests were ran on both servers available in the lab:



Hydra.uwb.edu

Device
#
2


50
blocks and
100
threads
Device
#
1


50
blocks and
100
threads
...
...


And Herc
ules.uwb.edu



HYDRA
Configuration



2 x TESLA C1060: 3.0 CC, 240 CUDA Cores, 4 GB

HYDRA Results

Simulation Space

Single

GPU

Multi GPU

Copy
-
based

Multi GPU

Zero
-
copy

100

0.002

4.389

0.204

300

0.011

4.551

0.885

500

0.019

4.769

2.217


HERCULES
Configuration



GeForce GTX 680: 3.0 CC, 1536 CUDA Cores, 2 GB



Quadro NVS 295: 1.1 CC, 8 CUDA Cores, 256 MB

HERCULES Results


Simulation Space

Single GPU

Multi GPU

Copy
-
based

Multi GPU

Zero
-
copy

100

0.004

0.

750

N/A

300

0.007

0.
949

N/A

500

0.015

1.161

N/A



Based on my tests, I found that neither method improved the performance of the simulation. In fact,
the redundant memory allocation, or additional communication as well synchronization of data
between the host and devices slowed the performance significan
tly.
Also, there is a large data
discrepancy between two server in terms of Multi GPU Copy
-
base test. On Hydra each simulation takes
a least 4 seconds but then consequent tests with
a
large
r simulation space doesn’t change as much. In
fact the change be
tween different simulation spaces is quite similar which makes me think that there is
a difference in hardware configuration such as slower PCIe bus.

Another finding was that I wasn’t able to run Zero
-
copy test on Hercules server as the Quadro NVS 295
wo
uld failed when passing a host memory reference to the device.




Lesson Learned


Based on this project, I’ve learned about CUDA technology and GPU Computing market. My knowledge
of memory management has expanded dramatically. I’ve learned what is needed
to go through the
process of implementing a single threaded, multithreaded application and then mapping it to the GPU
version with a single GPU or multiple GPUs. The coordination of multiple devices has been challenging
at times but at the end I’ve learn
ed a new skill, which hopefully, I’ll be able to apply at my work.



Next Steps


My next
immediate
step
would be to test my solution on
two
identical
GPU devices with compute
capability > 2.0
. The
se cards support Universal Virtual Addressing (UVA)
feature

which allows the data
to be copied
directly
between GPU cards and thus, removing the overhead of redundant copies of
data
between host and devices. I believe that these new results would show the performance improvement
when coordinating multiple GPU dev
ices.


The next phase would be generalize my solution in such a way that it could be used by other programs
such as Molecular Dynamics which could benefit from powerful and efficient GPU parallelization




References


1.

Nvidia

CUDA C Programming Guide © 2006
-
2011 NVIDIA Corporation. All rights reserved.

This work incorporates portions of on an earlier work: Scalable Parallel Programming with CUDA, in
ACM Queue, VOL 6, No. 2 (March/April 2008), © ACM, 2008.


2.

Emau, J., Chuang,
T., & Fukuda, M. (2011). A multi
-
process library for multi
-
agent and spatial

simulation. Paper presented at the Communications, Computers and Signal Processing

(PacRim), 2011 IEEE Pacific Rim Conference on, 369
-
375.


3.

Guilde, R
., Weeks, M., Owen, S., Pan, Y
i (2004).
Parallel Computing with Multiple GPUs on a Single
Machine to Achieve Performance Gains. Presented at
SIGGRAPH '04 ACM SIGGRAPH 2004 Posters
.


4.

Sain
-
Zee Ueng, Melvin Lathara, Sara S. Baghsorkhi and Wen
-
mei W. Hwu
;

CUDA
-
Lite: Reducing GPU
Programmin
g Complexity LCPC 2008, LNCS 5335, pp.1
-
15, 2008


5.

Seyong Lee, Seun
g
-
Jai Min, and Rudolf Eigenmann;

Open MP to GPGPU: A Compiler Framework for
Automatic Translation and Optimization

Proceedings of the 14th ACM SIGPLAN symposium on
Principles and practice o
f parallel programming
.


6.

BIDDISCOMBE, J. (ETH Zürich), LE TOUZÉ, D., (Ecole Centrale de Nantes), LEBOEUF, F. (Ecole Centrale
de Lyon), MARONGIU, JC., (ANDRITZ Hydro), OGER, G., (HydrOcean), SOUMAGNE, J.
,
Efficient
parallelizat
ion strategy for the SPH metho
d
, (ETH Zürich) NextMuSE (Next generation Multi
-
mechanics Simulation Environment) a Future and Emerging Technologies FP7
-
ICT European project


7.

Schaa, D.,

Kaeli, D.
;

Exploring the Multiple
-
GPU Design Space
.

Dept. of Electr. & Comput. Eng.,
Northeastern
Univ., Boston, MA, USA
Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE
International Symposium on

23
-
29 May 2009
, 1
-
12


8.

Sensor Cloud Integration: An Agent
-
Based Workbench for On
-
the
-
Fly Senor
-
Data Analysis

project

http://depts.washington.edu/dslab/SensorCloud/index.html


9.

NVidia
CUDA by Example: An Introduction to General
-
Purpose GPU Programming

by

Jason Sanders

,
Edward Kandrot
;
Addison
-
We
sley Professional; 1 edition (July 29, 2010)

10.

David B. Kirk

,
Wen
-
mei W. Hwu
,
Programming Massively Parallel Processors: A Hands
-
on Approach
(Applications of GPU Comp
uting Series)
;

Morgan Kaufmann; 1 edition (February 5, 2010)


11.

Cuda Developer Zone

http://developer.nvidia.com/