A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

coleslawokraΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 6 μήνες)

69 εμφανίσεις

A Scalable Heterogeneous
Parallelization Framework for

Iterative Local Searches

Martin
Burtscher
1

and Hassan Rabeti
2


1
Department of Computer Science, Texas State University
-
San Marcos

2
Department of Mathematics, Texas State University
-
San Marcos


Problem: HPC is Hard to Exploit


HPC application writers are
domain

experts


They are not typically computer scientists and have
little or no

formal
education

in parallel programming


Parallel programming is difficult and error prone



Modern HPC systems are
complex


Consist of interconnected compute
nodes

with
multiple
CPUs

and one or more
GPUs

per node


Require parallelization at multiple levels (inter
-
node,
intra
-
node, and accelerator) for best performance


A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

2

Target Area: Iterative Local Searches


Important application domain


Widely used in
engineering

& real
-
time environments


Examples


All sorts of random restart greedy algorithms


Ant colony opt, Monte Carlo,
n
-
opt hill climbing,
etc.


ILS properties


Iteratively

produce better solutions


Can exploit large amounts of
parallelism


Often have
exponential

search space

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

3

Our Solution: ILCS Framework


Iterative Local Champion Search
(ILCS) framework


Supports non
-
random restart heuristics


Genetic algorithms,
tabu

search, particle swarm opt,
etc.


Simplifies

implementation of ILS on parallel systems


Design goal


Ease of use
and
scalability


Framework benefits


Handles threading, communication, locking, resource
allocation, heterogeneity, load balance, termination
decision, and result recording (check pointing)

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

4

User Interface


User writes 3
serial

C functions and/or 3 single
-
GPU CUDA functions with some restrictions


size_t
CPU_Init
(int
argc
, char *
argv
[]);

void
CPU_Exec
(long
seed
, void const
*
champion
, void *
result
);

void
CPU_Output
(void const *
champion
);



See paper for GPU interface and sample code


Framework runs Exec (map) functions in parallel



A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

5

Internal Operation: Threading

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

6

Fc
Fc
Fc

Fc
Fc
Fc

Fc
Fc
Fc

Fc
Fc
Fc


Fg
h
h
Fg
h
h
Fg
h

GPU handler thread #1
Fg
h
h
Fg
h
h
Fg
h

GPU handler thread #2
user GPU code
user GPU code
Fm
Fm
user CPU code
user CPU code
user CPU code
user CPU code
user CPU code
user CPU code
user CPU code
user CPU code
user CPU code
user CPU code
user CPU code
user CPU code
worker thread #1
GPU2 worker threads
master/comm thread
GPU1 worker threads


worker thread #4
worker thread #3
worker thread #2
user GPU code
user GPU code
ILCS master
thread starts

master forks a
worker per core

master forks a
handler per GPU

workers evaluate
seeds, record local opt

GPU workers evaluate
seeds, record local opt

handlers launch GPU
code, sleep, record result

master sporadically finds
global opt via MPI, sleeps

Internal Operation: Seed Distribution


E.g., 4 nodes w/ 4 cores (a,b,c,d) and 2 GPUs (1,2)






Benefits


Balanced

workload irrespective of number of CPU
cores or GPUs (or their relative performance)


Users can generate other distributions from seeds


Any injective mapping results in
no redundant evaluations

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

7

CPUs
GPUs
CPUs
GPUs
CPUs
GPUs
CPUs
GPUs
0, 1, 2, …
…, 2
63
-1,
2
63
, …
…, 2
64
-2, 2
64
-1
2
62
, ...
…, 2
63
-1
a
b
c
d
a
b
c
d
a
b
1
2
1
2
1
2
1
2
1
2
1
CPU threads (one seed per thread at a time)
GPUs (strided range of seeds per GPU at a time)
Node 0
Node 1
Node 2
Node 3
each node gets chunk
of 64
-
bit seed range

CPUs process
chunk bottom up

GPUs process
chunk top down

Related Work


MapReduce
/
Hadoop
/MARS and PADO


Their generality and unnecessary features for ILS incur
overhead

and increase
learning curve


Some do not support accelerators, some require Java



ILCS framework is
optimized

for ILS applications


Reduction is provided, does not require multiple keys,
does not need secondary storage to buffer data,
directly supports non
-
random restart heuristics,
allows early termination, works with GPUs and MICs,
targets single
-
node workstations through HPC clusters

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

8

Evaluation Methodology


Three HPC Systems (at TACC and NICS)





Largest

tested configuration


A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

9

compute
CPU
CPU clock
GPU
GPU clock
nodes
cores
frequency
cores
frequency
Keeneland
264
528
4,224
2.6 GHz
792
405,504
1.3 GHz
Ranger
3,936
15,744
62,976
2.3 GHz
-
-
-
Stampede
6,400
12,800
102,400
2.7 GHz
128*
n/a
n/a
system
CPUs
GPUs
compute
total
total
total
total
nodes
CPUs
GPUs
CPU cores
GPU cores
Keeneland
128
256
384
2048
196,608
Ranger
2048
8192
0
32768
0
Stampede
1024
2048
0
16384
0
system
datacenterknowledge.com

Sample ILS Codes


Traveling Salesman
Problem (
TSP
)


Find shortest tour


4 inputs from TSPLIB


2
-
opt hill climbing


Finite State Machine (
FSM
)


Find best FSM config to
predict hit/miss events


4 sizes (
n

= 3, 4, 5, 6)


Monte Carlo method

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

10

FSM Transitions/Second Evaluated

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

11

0.0
5.0
10.0
15.0
20.0
25.0
3
-
bit FSM
4
-
bit FSM
5
-
bit FSM
6
-
bit FSM
transitions evaluated per sec (trillions)
Keeneland
Ranger
Stampede
21,532,197,798,304 s
-
1

GPU shmem limit

Ranger uses twice as many cores as Stampede

TSP Tour
-
Changes/Second Evaluated

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

12

0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
kroE100
ts225
rat575
d1291
moves evaluated per second (trillions)
Keeneland
Ranger
Stampede
12,239,050,704,370 s
-
1

based on serial CPU code

CPU pre
-
computes:
O(
n
2
) memory

GPU re
-
computes:
O(
n
) memory

each core
evals

a tour
change every 3.6 cycles

TSP Moves/Second/Node Evaluated

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

13

0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
50.0
kroE100
ts225
rat575
d1291
moves evaluated per second (billions)
Keeneland
Ranger
Stampede
GPUs provide >90% of performance on Keeneland

ILCS Scaling on Ranger (FSM)

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

14

1
10
100
1000
10000
transitions evaluated per sec (billions)
compute nodes
3
-
bit FSM
4
-
bit FSM
5
-
bit FSM
6
-
bit FSM
>99% parallel efficiency on 2048 nodes

other two systems are similar

ILCS Scaling on Ranger (TSP)

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

15

0.1
1
10
100
1000
10000
100000
moves evaluated per second (billions)
compute nodes
kroE100
ts225
rat575
d1291
>95% parallel efficiency on 2048 nodes

longer runs are even better

Intra
-
Node Scaling on Stampede (TSP)

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

16

0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
moves evaluated per second (billions)
worker threads
kroE100
ts225
rat575
d1291
>98.9% parallel efficiency on 16 threads

framework overhead is very small

Tour Quality Evolution (Keeneland)

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

17

0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
deviation from optimal tour length
step
kroE100
ts225
rat575
d1291
quality depends on chance: ILS provides good
solution quickly, then progressively improves it

Tour Quality after 6 Steps (Stampede)

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

18

0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
9.0%
1
2
4
8
16
32
64
128
256
512
1024
deviation from optimal tour length
compute nodes
kroE100
ts225
rat575
d1291
larger node counts typically
yield better results faster

Summary and Conclusions


ILCS Framework


Automatic

parallelization of iterative local searches


Provides MPI, OpenMP, and multi
-
GPU support


Checkpoints

currently best solution every few seconds


Scales

very well (decentralized)


Evaluation


2
-
opt hill climbing (TSP) and Monte Carlo method (FSM)


AMD + Intel
CPUs
, NVIDIA
GPUs
, and Intel
MICs


ILCS source code is freely available


http://cs.txstate.edu/~burtscher/research/ILCS/

Work supported by NSF, NVIDIA and Intel; resources provided by TACC and NICS

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

19