N-Body Simulation

filercaliforniaΜηχανική

14 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

65 εμφανίσεις

Kenneth Owens


We wish to compute the interaction between
particles (bodies) given their mass and
positions


Simulation is performed in time steps


Forces between all bodies is computed O(n
2
)


Positions for all bodies are updated based on their
current kinematics and the interaction with other
bodies O(n)


Time moves forward by one step



The force between a body
i

and N other bodies
is approximated as above by computing the
interaction given their mass (m), the distance
vector between them (
r
_ij
), and a softening
factor (
ε
).


This is computed for all bodies with all other
bodies


Euler Method: For each particle, a discrete
timestep

(
dt
) is used to approximate the
continuous kinematic equation and update
the position and velocity of each particle

dt
dv
a



dt
a
dv



dv
v
v
i
i



1


dt
a
v
v
i
i
i



1
dt
dx
v



dt
v
dx



dx
x
x
i
i



1


dt
v
x
x
i
i
i



1

Execute an n
-
body simulation on a
distributed memory architecture with multiple
GPUs on each node


Sequential implementation of n
-
body simulation code


Written in C


Compiled using gcc
-
4.4 with

O3


MPI implementation


Written in C


Compiled using
mpicc.mpich

with gcc
-
4.4 using 0
-
3


Executed using
mpirun.mpich

on 2,5, an 10 nodes


GPU implementation


Written in C with CUDA extensions


Compiled using
nvcc

with gcc
-
4.4 using

O3


Executed on
Nvidia

580s


MPI
-
GPU implementation


The MPI driver above was combined with the GPU kernel
implementation


Compiled but not tested for correctness


The main method of the driver calls
nbody


nbody

calls two externally linked function


c
ompute_forces

computes the interactions


u
pdate_positions

updates the kinematics

void

nbody
(
vector4d_t
*

positions
,

vector4d_t
*

velocities
,



vector4d_t
*

current_positions
,

vector4d_t
*

current_velocities
,

vector3d_t
*

accel
,

size_t

size
,


value_t

dt
,

value_t

damping
,


value_t

softening_squared
)

{






compute_forces
(
positions
,
accel
,

size
,

positions
,

size
,

softening_squared
);


update_positions
(
positions
,

velocities
,

current_positions
,

current_velocities
,

accel
,

size
,

dt
,

damping
);





}


Computes the pair
-
wise interaction


Hidden second loop in acceleration function

void

compute_forces
(
vector4d_t
*

positions
,

vector3d_t
*

forces
,

size_t

positions_size
,

vector4d_t
*

sources
,

size_t

sources_size
,

value_t

softening_squared
)

{


for

(
size_t

i

=

0
;

i

<

positions_size
;

i
++)



{


forces
[
i
]

=

acceleration
(
positions
[
i
],
sources
,
sources_size
,

forces
[
i
],

softening_squared
);



}



}


Computation for individual interaction written
using c

vector3d_t interaction
(
vector3d_t acceleration
,

vector4d_t body1
,

vector4d_t body2
,

value_t

softening_squared
)


{


vector3d_t force
;




force
.
x

=

body1
.
x
-

body2
.
x
;


force
.
y

=

body1
.
y
-

body2
.
y
;


force
.
z

=

body1
.
z
-

body2
.
z
;




float

distSqr

=

force
.
x

*

force
.
x

+

force
.
y

*

force
.
y

+

force
.
z

*

force
.
z
;


distSqr

+=

softening_squared
;




float

invDist

=

1.0f

/

sqrt
(
distSqr
);


float

invDistCube

=

invDist

*

invDist

*

invDist
;




float

s
=

body2
.
w
*

invDistCube
;


acceleration
.
x

+=

force
.
x

*

s
;


acceleration
.
y

+=

force
.
y

*

s
;


acceleration
.
z

+=

force
.
z

*

s
;


return

acceleration
;

}


Updates each position based on the
computed forces

void

update_positions
(
vector4d_t
*

positions
,



vector4d_t
*

velocities
,



vector4d_t
*

current_positions
,



vector4d_t
*

current_velocities
,



vector3d_t
*

acceleration
,


size_t

size
,


value_t

dt
,

value_t

damping
)

{


for
(
size_t

i

=

0
;

i

<

size
;

i
++)


{




vector4d_t
current_position

=

current_positions
[
i
];


vector3d_t
accel

=

acceleration
[
i
];


vector4d_t
current_velocity

=

current_velocities
[
i
];


update_position
(&
positions
[
i
],

&
velocities
[
i
],

current_position
,

current_velocity
,

accel
,

dt
,

damping
);


}

}


Implements the previously shown equations

void

update_position
(
vector4d_t
*

position
,

vector4d_t
*

velocity
,

vector4d_t
current_position
,

vector4d_t
current_velocity
,

vector3d_t
acceleration
,
value_t

dt
,

value_t

damping
)

{


current_velocity
.
x

+=

acceleration
.
x

*

dt
;


current_velocity
.
y

+=

acceleration
.
y

*

dt
;


current_velocity
.
z

+=

acceleration
.
z

*

dt
;



current_velocity
.
x

*=

damping
;


current_velocity
.
y

*=

damping
;


current_velocity
.
z

*=

damping
;




current_position
.
x

+=

current_velocity
.
x

*

dt
;


current_position
.
y

+=

current_velocity
.
y

*

dt
;


current_position
.
z

+=

current_velocity
.
z

*

dt
;


*
position
=

current_position
;


*
velocity
=

current_velocity
;

}


Started with the implementation from GPU
Gems
http://http.developer.nvidia.com/GPUGems3/
gpugems3_ch31.html


Modified the code to work with data sizes
that are larger than 256 but that are not
evenly divisible by 256


Added kinematics update


Code no longer works for sizes less than 256


Needed command line specification to control grid
and block size anyway


Copies to device memory and execute the
compute_force_gpu

kernel


Note
-

cudaMemAlloc

truncated to fit code

void

compute_forces
(
vector4d_t
*

positions
,

vector3d_t
*

forces
,

size_t

positions_size
,

vector4d_t
*

sources
,

size_t

sources_size
,

value_t

softening_squared
)

{


…..


compute_forces_gpu
<<<

grid
,

threads
,

sharedMemSize

>>>(
device_positions
,

device_forces
,

positions_size
,


device_sources
,

sources_size
,


softening_squared


);


cudaThreadSynchronize
();


cudaMemcpy
(
forces
,

device_forces
,

positions_size

*

sizeof
(
float3
),



cudaMemcpyDeviceToHost
);




cudaFree
((
void
**)
device_positions
);


cudaFree
((
void
**)
device_sources
);


cudaFree
((
void
**)
device_forces
);




err
=

cudaGetLastError
();


if
(

cudaSuccess

!=

err
)



{


fprintf
(
stderr
,

"
Cuda

error: %s:
\
n"
,



cudaGetErrorString
(

err
)

);





Every thread computes the acceleration for its
position and moves to the next block


For our test sizes this only implemented cleanup for
strides not divisible by 256

__global__
void

compute_forces_gpu
(
float4
*

positions
,

float3
*

forces
,
int

size
,


float4
*

sources
,

int

sources_size
,


float

softening_squared


)

{




for
(
int

index
=

__mul24
(
blockIdx
.
x
,
blockDim
.
x
)

+

threadIdx
.
x
;



index
<

size
;

index
+=

blockDim
.
x

*

gridDim
.
x
)


{


float4 pos
=

positions
[
index
];



forces
[
index
]

=

acceleration
(
pos
,

sources
,

sources_size
,

forces
[
index
],

softening_squared
);


Uses float3 and float4 instead of home brewed vector types


Shared memory is used 256 positions per block


Each thread strides across the grid to update a single particle

__device__ float3

acceleration
(
float4 position
,

float4
*

positions
,

int

size
,

float3 acc
,

float

softening_squared
)

{


extern

__shared__ float4
sharedPos
[];




int

p
=

blockDim
.
x
;


int

q
=

blockDim
.
y
;


int

n
=

size
;


int

numTiles

=

n
/

(
p
*

q
);


for

(
int

tile
=

blockIdx
.
y
;

tile
<

numTiles

+

blockIdx
.
y
;

tile
++)



{


sharedPos
[
threadIdx
.
x
+
blockDim
.
x
*
threadIdx
.
y
]

=

positions
[
WRAP
(
blockIdx
.
x

+

tile
,
gridDim
.
x
)

*

p
+

threadIdx
.
x
];




__
syncthreads
();


// This is the "
tile_calculation
" function from the GPUG3 article.


acc
=

gravitation
(
position
,

acc
,
softening_squared
);




__
syncthreads
();


}


return

acc
;

}


Kernel strides in the same way as the force computation


All threads update a single position
simulaneously

__global__
void

update_positions_gpu
(
float4
*

positions
,

float4
*

velocities
,



float4
*

current_positions
,

float4
*

current_velocities
,


float3
*

forces
,

int

size
,


float

dt
,

float

damping
)

{


for
(
int

index
=

__mul24
(
blockIdx
.
x
,
blockDim
.
x
)

+

threadIdx
.
x
;



index
<

size
;

index
+=

blockDim
.
x

*

gridDim
.
x
)


{


float4 pos
=

current_positions
[
index
];



float3
accel

=

forces
[
index
];


float4
vel

=

current_velocities
[
index
];




vel
.
x

+=

accel
.
x

*

dt
;


vel
.
y

+=

accel
.
y

*

dt
;


vel
.
z

+=

accel
.
z

*

dt
;



vel
.
x

*=

damping
;


vel
.
y

*=

damping
;


vel
.
z

*=

damping
;




// new position = old position + velocity *
deltaTime


pos
.
x

+=

vel
.
x

*

dt
;


pos
.
y

+=

vel
.
y

*

dt
;


pos
.
z

+=

vel
.
z

*

dt
;


// store new position and velocity


positions
[
index
]

=

pos
;


velocities
[
index
]

=

vel
;


}

}


O(n
2
)/p pipeline implementation


Particles are divided among processes


Particle positions are shared in a ring
communication topology


Force computation occurs for all particles by
sending the data around the ring


After all forces are computed each process updates
the kinematics of its own particles


Compiles with CPU and GPU implementations


Timings have only been collected for CPU

for
(
size_t

i

=

0
;

i

<

time_steps
;

i
++)


{


memcpy
(

sendbuf
,

current_positions
,

num_particles

*

sizeof
(
vector4d_t
)

);





for

(
pipe
=
0
;

pipe
<
size
;

pipe
++)

{



if

(
pipe
!=

size
-
1
)

{



MPI_Isend
(

sendbuf
,

num_particles
,

mpi_vector4d_t
,

right
,

pipe
,



commring
,

&
request
[
0
]

);



MPI_Irecv
(

recvbuf
,

num_particles
,

mpi_vector4d_t
,

left
,

pipe
,



commring
,

&
request
[
1
]

);



}





compute_forces
(
positions
,
accel
,

num_particles
,

positions
,

num_particles
,

softening_squared
);




if

(
pipe
!=

size
-
1
)



MPI_Waitall
(

2
,

request
,

statuses
);



memcpy
(

sendbuf
,

recvbuf
,

num_particles

*

sizeof
(
vector4d_t
)

);



}





update_positions
(
positions
,

velocities
,

current_positions
,

current_velocities
,

accel
,

num_particles
,

dt
,

damping
);


}


Taken of float for sequential and
gpu


Taken on tux for
mpi


All used 10 iterations for time steps


Wallclock

time was collected for comparison


Memory allocation time was omitted


Except for device memory allocation and device
data transfer


Timings where not collected for the code
using MPI to distribute data over multiple
nodes with multiple GPUs


0
50
100
150
200
250
300
350
100
200
300
400
500
600
700
800
900
1000
1100
1200
sequential

sequential
0
0.002
0.004
0.006
0.008
0.01
0.012
1
2
3
4
5
6
7
8
9
10
11
12
Sequential

gflops
0
5
10
15
20
25
30
35
40
GPU

time
0
50
100
150
200
250
GPU

gflops
0
20
40
60
80
100
120
140
160
180
100
200
300
400
500
600
700
800
900
1000
1100
1200
MPI

n=2
n=5
n=10
0
0.02
0.04
0.06
0.08
0.1
0.12
100
200
300
400
500
600
700
800
900
1000
1100
1200
MPI

n=2
n=5
n=10

We achieved several orders of magnitude
speed
-
up going to a GPU


We achieved similar results to what was
obtained in GPU gems


The sequential implementation was not
optimal as it did not use SSE or multiple cores


much lower than the theoretical possible
FLOPs for the Xeon CPU


The MPI driver showed that task level
parallelism can be exploited using distributed
memory computing



Run the MPI GPU version on Draco


FMM (Fast Multiple Method) MPI
implementation


Multi
-
device GPU implementation