Legacy Applications for Multiple Platforms

coleslawokraSoftware and s/w Development

Dec 1, 2013 (3 years and 11 months ago)

73 views

A High
-
Level Framework for
Parallelizing
Legacy Applications
for Multiple Platforms

Ritu Arora

Texas Advanced Computing Center

Email:
rauta@tacc.utexas.edu


Outline


Motivation and Goals


Overview of the Framework with demos


Results


Features and Benefits


Project
Status


Future Work


Conclusion


Q & A



2

Plenty of Parallel Programming Languages
and Paradigms

MPI


OpenMP

CUDA

OpenCL


Implicitly Parallel Languages (X10, Fortress, SISAL)

PGAS languages (UPC, Co
-
Array Fortran)

Offload programming for MIC

SHMEM

Cilk
/Intel
Cilk

Plus

Charm++

HPF


Hybrid programming (MPI + OpenMP)

3

4

There is a need to develop a tool (high
-
level framework) that offers a
low
-
risk way for domain
-
experts to try HPC but first…

… Understanding the Mindset of the User
Community is Important

“…
the
history of HPC is littered
with new
technologies
that promised
increased
scientific productivity
but are no longer available
.”


A new technology that can
coexist with
older ones has a greater
chance of success
than one
requiring complete buy
-
in at the beginning
.”


For many frameworks, a significant barrier
to their
use is that you
can’t integrate them incrementally
.”


Frameworks provide programmers a higher level of abstraction, but at
the cost of adopting the framework’s perspective on how to structure
the code
.”



5

Source:
Understanding the High Performance Computing Community: A
Software Engineer’s
Perspective,
Basili

et al.

Standard and Non
-
Standard
S
teps for
Parallelization that are Repeatable


Examples of standard steps in developing an MPI
application
(common in all MPI programs)


Every MPI program has
#
include "
mpi.h
"


Every MPI program has
MPI_Init

and

MPI_Finalize

function
calls



Non
-
standard steps in developing an MPI application


for
-
loop
parallelization,
data distribution
, mapping of tasks
to processes, and orchestration
of exchange
of messages


Steps
for splitting the work in a for
-
loop amongst all the processes
in
MPI_COMM_WORLD
are standard for a given load
-
balancing
scheme



6

Goals


Develop a high
-
level framework for semi
-
automatic
parallelization that can leverage the investment made
in legacy applications


The framework should be built on top of successful
programming paradigms like MPI, OpenMP, and CUDA


Provide support for incremental parallelism through
the framework


Abstract the standard and non
-
standard steps in
parallelization






7

Outline


Motivation and Goals


Overview of the Framework with demos


Results


Features and Benefits


Project
Status


Future Work


Conclusion


Q & A



8

How Does the Framework
W
ork?



9

Providing Specifications Through Hi
-
PaL

(
1)

Parallel section begins <
hook type
> (
<
hook pattern
>
)
mapping is
<
mapping type
>
{


<
Hi
-
PaL

API for specifying the operation>
<
hook
>

&&


in
function

(<
function name
>)

}

10

OMP_Parallel

{


<
Hi
-
PaL

API for specifying the operation>
&& schedule
is


<
schedule type
> <
hook
>

&&
in function

(<
function name
>)

}

General Structure of Hi
-
PaL

Code to Generate
MPI Code

General Structure of Hi
-
PaL

Code to Generate
OpenMP Code

Providing Specifications Through Hi
-
PaL

(2)

A set of Hi
-
PaL

API has
been
developed for precisely capturing the end
-
users’
specifications at a
high
-
level

11

Hi
-
PaL

API

Description

ParExchange2DArrayInt(<array
name>, <num of rows>, <num
of columns>)

Exchange neighboring values in
stencil
-
based computations

Parallelize_For_Loop where
(<for_init_stmt>;
<condition>; <stride>)

Parallelize for
-
loop with matching
condition, stride and initialization
statement

ReduceSumInt(<variable
name>)

MPI_Reduce

with MPI_SUM
operation or OpenMP reduction
clause with ‘+’ operator; reduced
癡物慢汥 楳i潦o瑹灥⁩湴 来爠

Parallelizing Poisson solver (1)

12

1
.

//other

code


2
.

NTIMES

=

atoi
(
argv
[
3
])
;


3
.

a

=

allocMatrix
<double>(a,

M,

N)
;


4
.

b

=

allocMatrix
<double>(b,

M,

N)
;


5
.

f

=

allocMatrix
<double>(f,

M,

N)
;


6
.

start

=

0
;


7
.

//other

code


8
.

printMatrix
<double>(a,

M,

N)
;


9
.

t
1

=

gettime
()
;


10
.

for

(k

=

start
;

k

<

NTIMES

&&

norm

>=

tolerance
;

k++)

{


11
.

b

=

compute(a,

f,

b,

M,

N)
;


12
.

ptr

=

a
;


13
.

a

=

b
;


14
.

b

=

ptr
;


15
.

norm

=

normdiff
(b,

a,

M,

N)
;


16
.

}


17
.

t
2

=

gettime
()
;
//other

code


Code snippet of serial Poisson Solver Code

Parallelizing Poisson solver
(2)

1
.

Parallel

section

begins

after

(
"NTIMES

=

atoi
(
argv
[
3
])
;
"
)

mapping

is

Linear
{


2
.

ParExchange
2
DArrayDouble

(
a
,

M
,

N
)

before

statement



(
"
printMatrix
<double>(a,

M,

N)
;
"
)



&&

in

function

(
"main"
)
;


3
.

ParExchange
2
DArrayDouble

(
b
,

M
,

N
)

before

statement



(
"
printMatrix
<double>(a,

M,

N)
;
"
)



&&

in

function

(
"main"
)
;


4
.

ParExchange
2
DArrayDouble

(
b
,

M
,

N
)

after

statement



(
"b=compute(a,

f,

b,

M,

N)
;
"
)

&&

in

function

(
"main"
)
;


5
.

AllReduceSumInt
(
norm
)

after

statement



(
"norm

=

normdiff
(b,

a,

M,

N)
;
"
)

&&

in

function

(
"main"
)


6
.

}


13

Hi
-
PaL

Code to Generate
MPI Code for Poisson Solver

Generated MPI Code for Poisson Solver (1)

1. //other code

2. NTIMES =
atoi
(
argv
[3]);

3.
MPI_Init
(NULL, NULL);

4.
MPI_Comm_size
(MPI_COMM_WORLD, &
size_Fraspa
);

5.
MPI_Comm_rank
(MPI_COMM_WORLD, &
rank_Fraspa
);

6. create_2dgrid(MPI_COMM_WORLD, &comm2d_Fraspa,…);

7.
create_diagcomm
(MPI_COMM_WORLD,
size_Fraspa
, …);

8.
rowmap_Fraspa.init
(M,
P_Fraspa
,
p_Fraspa
);

9.
colmap_Fraspa.init
(N,
Q_Fraspa
,
q_Fraspa
);

10.
myrows_Fraspa

=
rowmap_Fraspa.getMyCount
();

11.
mycols_Fraspa

=
colmap_Fraspa.getMyCount
();

12.
M_Fraspa

= M;

13.
N_Fraspa

= N;

14. M =
myrows_Fraspa
;

15. N =
mycols_Fraspa
;

16. a =
allocMatrix
<double>(a, M, N);

14

Generated MPI Code for Poisson
Solver (2)

17
. b =
allocMatrix
<double>(b, M, N);

18. f =
allocMatrix
<double>(f, M, N);

19. start = 0;

20. //other code

21. a = exchange<double>(a,
myrows_Fraspa

+ 2,
…);

22. b = exchange<double>(b,
myrows_Fraspa

+ 2,
…);

23.
printMatrix
<double>(a, M, N);

24. t1 =
MPI_Wtime
();

25. for (k = start; k < NTIMES && norm >= tolerance; k++) {

26. b = compute(a, f, b, M, N);

27. b = exchange<double>(b,
myrows_Fraspa

+ 2,
…);

28.
ptr

= a;

29. a = b;

30. b =
ptr
;

31. norm =
normdiff
(b, a, M, N);

32.
MPI_Allreduce
(&norm, &
norm_Fraspa
, 1, MPI_INT, MPI_SUM,…);

33. norm =
norm_Fraspa
;

34. }


36. //other code

15

Snippet of Exchange Template for MPI

template
<
typename

T>

T** exchange(T** data,
int

nrows
,
int

ncols
,
int

P,
int

Q,
int

p,
int

q
,
MPI_Comm

comm2d,
MPI_Comm

rowcomm
,
MPI_Comm

colcomm
) {


//other code above. Create
datatype

for the
recvtype

code below


MPI_Type_vector
(nrows
-
2, 1,
ncols
,
datatype
, &
temptype
);


MPI_Type_extent
(
datatype
, &
sizeoftype
) ;


int

blens
[2] = {1, 1};


MPI_Aint

displ
[2] = {0,
sizeoftype
};


MPI_Datatype

types[2] = {
temptype
, MPI_UB};


MPI_Type_struct

(2,
blens
,
displ
, types, &
vectype
);


MPI_Type_commit
(&
vectype
);


MPI_Cart_shift
(
rowcomm
, 0,
-
1, &
prev
, &next);


MPI_Cart_shift
(
colcomm
, 0,
-
1, &down, &up);

//
send and receive the boundary rows


MPI_Irecv
(&data[0][1], ncols
-
2,
datatype
, up, 0
, …);


MPI_Irecv
(&data[nrows
-
1][1], ncols
-
2,
datatype
, down, 0,
…);


MPI_Isend
(&data[1][1], ncols
-
2,
datatype
, up, 0,
colcomm
, …);



MPI_Isend
(&data[nrows
-
2][1], ncols
-
2,
datatype
, down,
…);



…}

16

Providing the Specifications Through
Command
-
Line Interface

17

Would you like to use MPI or OpenMP?

(1) MPI

(2) OpenMP

2

==================================================

Would you like to use MIC?

(1) Yes

(2) No

1

==================================================

Would you like to use this for loop? Y or N?


for ( i = 0; i <
ihi
; i++ ){


i4_to_bvec ( i, n,
bvec

);


value =
circuit_value

(n,
bvec
);


... //other

lines of code



}

Y

This loop contains the following variables:

i,value,j,solution_num

Choose the variables for reduction:

solution_num

Operation Complete...OpenMP code with offload capability is generated

Providing the Specifications Through
Graphical User Interface

18

Currently Available Support For…

Parallel Programming
Paradigms

MPI

OpenMP

MPI +
OpenMP

OpenMP
+ Offload

CUDA

Parallel
Programming
Patterns

For
-
loops with
Reduction

Stencil
-
Based
Computations
(Regular Mesh)

Pipeline

Replicable

Base languages
supported

C/C++

19

Support for
Fortran will
be added …

Time for demos



Using the framework through GUI



Using the framework through Hi
-
PaL

20

Outline


Motivation and Goals


Overview of the Framework with demos


Results


Features and Benefits


Project
Status


Future Work


Conclusion


Q & A



21

Results:
Poisson Solver (
Hi
-
PaL

based MPI
)

22

Results: Genetic Algorithm for Content Based
Image Retrieval (
Hi
-
PaL

based MPI & OpenMP
)

23

Results: Seismic Tomography Code (
GUI
-
based
CUDA
)

24

25

Results: Circuit
Satisfiability

Code (
GUI
-
based
OpenMP + Offload
)

Outline


Motivation and Goals


Overview of the Framework with demos


Results


Features and Benefits


Project Status


Future Work


Conclusion


Q & A



26

Summary of Features & Benefits of the
Framework


E
nhances
the productivity of the end
-
users in terms
of the reduction in the time and effort


reduction
in manual effort by over 90%

while ensuring
that the performance of the generated parallel code is
within 5% of the
sample hand
-
written
parallel
code


Leverages
the knowledge of expert parallel
programmers


S
eparates
the sequential and parallel programming
concerns while preserving the existing version of
sequential
applications

27

Outline


Motivation and goals


Overview of the Framework with demos


Results


Features and Benefits


Project Status


Future Work


Conclusion


Q & A



28

Project Status

29

Outline


Motivation and goals


Overview of the Framework with demos


Results


Features and Benefits


Project Status


Future Work


Conclusion


Q & A



30

Future Work


Migrate from DMS to Rose source
-
to
-
source compiler
and integrate all the interfaces


Usability studies:


Should be able to rank the user preferences for the interface


Prioritize the development effort


Integration with
PerfExpert

and Eclipse


Ability to handle irregular meshes and specify pipeline
mode of
communication through GUI


Address the demands for


a directives
-
based interface


the option of editing the log
-
file to repeat the code
-
generation process without going through the GUI


31

Outline


Motivation and goals


Overview of the Framework with demos


Results


Features and Benefits


Project Status


Future Work


Conclusion


Q & A



32

Conclusion


Through this research and development effort, we
have demonstrated


a
n approach for lowering the adoption barriers to HPC by
raising the level of abstraction of parallel programming


a
n interactive tool for teaching parallel programming: think
“Alice” and “
DrJava



the usage of multiple interfaces to accommodate the
preferences of the user
-
community: one size does not fit all


that it is possible to achieve abstraction and
performance
at
the same
time







33

Acknowledgement

On behalf of my co
-
authors, student and XSEDE
intern (Julio Olaya), I would like to thank NSF,
XSEDE and TACC!

34


Thanks for listening!


Any questions, comments or concerns?


Contact:
rauta@tacc.utexas.edu


35