Multiresolution Analysis, Computational Chemistry, and Implications for High Productivity Parallel Programming

footballsyrupΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 10 μήνες)

75 εμφανίσεις

Multiresolution Analysis, Computational Chemistry, and Implications for High Productivity Parallel Programming

Aniruddha G. Shet, James Dinan, Robert J. Harrison, and P. Sadayappan

Background



Multiresolution Analysis (MRA)



Mathematical technique of function approximation



Representation is a hierarchy of coefficients



Dynamically adapts to guarantee the accuracy of
the approximation



Varying degree of information granularity in the hierarchy



Trade numerical accuracy with computation time

MRA can represent different areas in an
object at different levels of detail.

What is (the) MADNESS (about)?



M
ultiresolution
Ad
aptive
N
umerical
E
nvironment for
S
cientific
S
imulation



Programming environment for the solution of integral and differential equations



Built on adaptive multiresolution analysis in multiwavelet basis and low separation rank
methods for scaling to higher dimensions



Fast algorithms with guaranteed precision



Trade precision for speed



High
-
level composition of numerical codes



Work with functions and operators



Target applications



Quantum chemistry, atomic and molecular physics, material science, nuclear structure


A molecular orbital of the benzene molecule
with the adaptive mesh also displayed.

Implementation Issues with MADNESS



Multi
-
dimensional tree distribution



Multiresolution adaptive properties produce
unbalanced coefficient trees (binary tree in 1
-
d,
quad tree in 2
-
d, oct tree in 3
-
d etc.)



Tree structure evolves in unscheduled ways
due to very flexible adaptive refinement



Need a scheme to partition the complete tree
as the entire tree, and not just the leaf nodes,
is utilized in some algorithms



Two main types of applications



Few large trees (billions of nodes)
-

time evolution of wavepackets in molecular physics



Many (thousands) smaller trees (millions of nodes)
-

materials and electronic structure



Nodes range from 1KB
-
1MB



Algorithmic characteristics



Some are recursive, tree
-
walking algorithms that move data up/down the tree structure



Number of levels navigated varies dynamically, and may constitute a data
dependence chain



Some move data laterally within same level of tree i.e. between neighboring nodes



Some algorithms involve applying mathematical functions to the collection of coefficient
tensors, and possibly combining individual results



Certain algorithms operate on multiple trees having different refinement characteristics
and produce a new tree

Need to express and manage huge amounts of nested hierarchical
concurrency on distributed many
-
core petascale machines

Binary tree numerical form of a 1
-
d
analytical function. Note that some
intervals are not sub
-
divided due to
the adaptive nature of refinement.

Sub
-
trees from
decomposition of
function space

Function interval

Coefficient tensor

Tree node

Adaptively refined tree

Compress tree algorithm

Reconstruct tree algorithm

X

Multiplication of differently refined trees

Dashed arrows depict the flow of data between operations on tree nodes. The
input data to a node operation is indicated by an arrow pointing into the node,
and the output from the operation is shown by an outgoing arrow.

A coefficient tensor that is added to the tree during a tree algorithm.

A coefficient tensor that is removed from the tree after it has been operated upon.

Parallel Programming Challenges



Shared Memory Model



Cilk
-
style fork
-
join task parallelism with a work
-
stealing runtime doing dynamic load
balancing is a plausible solution, but…



it is not targeted at distributed data



the model mandates that a parent task await the completion of children tasks,
which
constrains the full expression of available parallelism



Message Passing Model



Hard to express dynamic, irregular computations



Two
-
sided communication model introduces unnecessary overhead in reading and
writing distributed tree data



Does not address the need for dynamic load balancing



Partitioned Global Address Space (PGAS) Models



i.e. Co
-
Array Fortran, UPC, Titanium



Static SPMD model of parallelism, lack flexible threading capabilities



Maintenance of distributed data structures without remote operations requires complex
low
-
level remote memory references



Does not address the need for dynamic load balancing

Chapel Programming Model



Multithreaded parallel programming



Global view of computation, data structures



Abstractions for data and task parallelism



data:
domain
,
forall
,
iterators



task:
begin
,
cobegin
,
coforall
,
sync

variables,
atomic



Composition of parallelism



Virtualization of threads



Locality
-
aware programming



locale
:

machine

unit

of storage and processing



domains may be distributed across locales



on

keyword binds computation to locale(s)



Object
-
oriented programming



OOP can help manage program complexity



Classes and objects are provided in Chapel, but their use is typically not required



Advanced language features (e.g. distributions) expressed using classes



Generic programming and type inference



Type parameters



Latent types



Variables are statically
-
typed

Solution Building Blocks

class

FTree {


const

tree: [LocaleSpace] SubTree;


def

FTree(order:
int
) {


coforall

loc
in

Locales
do


on

loc
do


tree[loc.id]=
new
SubTree(order);


}


def

this
(node: Node) {


const

t = tree[node2loc(node).id];


return

t[node];


}


/*

Global tree access methods

*/


}


class

SubTree {


const

coeffDom:
domain
(1);


var

nodes:
domain
(Node);


var

coeff: [nodes] [coeffDom]
real
;


/*
Local tree access methods

*/


}



Global
-
view container



Container to store the tree, presents a global
-
view and one
-
sided access to tree algorithms



Internally, maintains a directory of the distributed
collection of sub
-
trees and transparently maps an
indexed node to the host locale



Sub
-
trees are structured as associative arrays
of node
-
coefficients key
-
value pairs



The mapping scheme could be a simple hash
function or driven by a specialized partitioning
strategy having better locality properties



Concurrency



Operations on tree nodes are created as
asynchronous tasks



Tasks are chained together in a hierarchical
nested manner to express recursive parallelism
in tree algorithms



Dependencies are handled by letting tasks
synchronize on the completion of spawned tasks



Locality control permits running tasks using
owner
-
computes policy which will launch tasks
where target data resides



Plan to support work
-
stealing technique for
dynamic load balancing of a distributed set of
tasks in the language runtime

const

myFTree =
new

FTree(order=5);


def

walkDownOp(node: Node) {


/*
Perform the operation on node
*/



for
child

in
node.getChild()
do




on

myFTree.node2loc(child)
do


begin

walkDownOp(child);

}


sync

on

myFTree.node2loc(root)
do


begin

walkDownOp(root);


def

walkUpOp(node: Node) {


coforall

child
in

node.getChild()
do


on

myFTree.node2loc(child)
do


walkUpOp(child);



/*
Perform the operation on node

*/

}


sync

on

myFTree.node2loc(root)
do


begin

walkUpOp(root);



Coefficient tensor arithmetic



Chapel provides ZPL
-
style “array language”
to simplify working with multi
-
dimensional arrays



Domains, a first
-
class language concept
denoting an index set, define the size and
shape of arrays, and support data parallel
iteration in creating and slicing arrays



A range of array operators facilitate parallel
operations on whole arrays or array slices
eliminating need for tedious array indexing



Mathematical functions involving coefficient
tensors are easily expressed in the array
language

/*
Overloaded multiplication operator to
perform vector
-
matrix multiplication

*/

def

*(V: []
real
, M: []
real
)

where

V.rank == 1 && M.rank == 2 {


const

R: [i
in

M.
domain
.dim(2)]
real


= +
reduce
(V * M(..,i));


return

R;

}


/*

Overloaded multiplication operator to
perform matrix
-
vector multiplication
*/

def

*(M: []
real
, V: []
real
)

where

M.rank == 2 && V.rank == 1 {


const

R: [i
in

M.
domain
.dim(1)]
real


= +
reduce
(M(i,..) * V);


return

R;

}


/*
Norm of a vector

*/

def

normf(V)
where

V.rank == 1 {


return

sqrt(+
reduce

V**2);

}



Python
-
like high
-
level programming



Better usability in writing end user
-
codes
using functions and operators rather than
their underlying representations



Key language ideas are object
-
oriented
and type inferencing features



Future work will explore writing dimension
-
independent programs

/* Fn_Test1

class wraps an analytic function
*/

var

f =
new

Fn_Test1();


/* Function

class holds the numerical tree form
*/

var

F =
new

Function(k=5,thresh=1e
-
5,f=f);


/*

Overloaded arithmetic operators on
Function

class that invoke various tree algorithms
*/


var

H = F + F; H = F * F;


/*

Print the numerical value at a given point
*/


writeln
("Numerical value: ",
format
("
%0.8f", F(0.5)));

Solution Building Blocks

Future Work



Recursive task parallelism



Language constructs to distinguish between tasks that
may

vs.
must

run in parallel



“May” construct permits runtime management of parallelism



“Must” construct for patterns like producer
-
consumer



Application vs. compiler control over the granularity and degree of parallelism




DAG
-
based dynamic scheduling of tasks inside Chapel runtime



Provide execution

guarantees for a class of DAGs



Compiler support for parallel distribution/iteration/algebraic operations
on associative array structures

Putting the Pieces Together

def

refine(node = root) {


const

child = node.getChild();



var

sc: [0..2*k
-
1]
real
;


sc[0..k
-
1] = project(child(1));


sc[k..2*k
-
1] = project(child(2));



const

dc = sc*hgT;


const

nf = normf(dc[k..2*k
-
1]);




if
(nf < thresh) {


sumC
[child(1)] = sc[0..k
-
1];


sumC
[child(2)] = sc[k..2*k
-
1];


}


else

{


on

sumC
.node2loc(child(1))
do


begin

refine(child(1));


on

sumC
.node2loc(child(2))
do


begin

refine(child(2));


}

}



sync

on

sumC
.node2loc(node)
do


begin

refine(node);

def

compress(node = root) {


const

child = node.getChild();


cobegin

{



on

sumC
.node2loc(child(1))
do


if

!
sumC
.hasCoeffs(child(1))
then



compress(child(1));


on

sumC
.node2loc(child(2))
do


if

!
sumC
.hasCoeffs(child(2))
then



compress(child(2));


}



var

sc: [0..2*k
-
1]
real
;


sc[0..k
-
1] =
sumC
[child(1)];


sc[k..2*k
-
1] =
sumC
[child(2)];



const

dc = sc*hgT;


sumC
[node] = dc[0..k
-
1];


diffC
[node] = dc[k..2*k
-
1];



sumC
.remove(child(1));


sumC
.remove(child(2));

}


sync

on

sumC
.node2loc(root)
do


begin

compress(root);



Parallel compress algorithm



Parallel refine algorithm

sumC

and
diffC

are global
-
view tree containers

Research sponsored in part by the Laboratory Directed Research and Development Program and Post Masters Research
Participation Program of Oak Ridge National Laboratory (ORNL), managed by UT
-
Battelle, LLC for the U. S. Department of
Energy under Contract No. DE
-
AC05
-
00OR22725, and DOE grant #DE
-
FC02
-
06ER25755 and NSF grant #0403342.