Parallel Programming for Grids: Phylogenetic Trees

compliantprotectiveSoftware and s/w Development

Dec 1, 2013 (3 years and 8 months ago)

71 views

4005
-
739 Seminar Grid Computing I: Concepts and Practice ∙
http://blackrose02.rit.edu/wiki/doku.php?id=grid:seminar1

1

Parallel Programming for Grids: Phylogenetic Trees


Andrew D. Brown
1

and

Gregor von Laszewski
1


1
Rochester Institute of Technology, Rochester, NY 14623, USA


A system

for computing phylogenetic trees in a grid
computing environment is developed
. The proj
e
ct uses

phylogenetic trees

as
a basis

for
experimenting with parallel

programming
in this environment, and illuminates some differences between that and
programming for
traditional cluster or SMP computer
s
.
Phylogenetic trees have
been chosen

because they

exhibit multiple forms of
parallelism
, which can be used
in several different scenarios
.

Also of particular interest is the project’s use of the Java CoG Kit.
Methods for effectively submitting multiple batch jobs and dynamically modifying Karajan work
flows are presented.


Index Terms

Parallel programming,
b
iology.



I.
I
NTRODUCTION

ARALLEL PROGRAMMING

for a grid environment poses
ma
ny challenges in addition to those which are

normally
un
dertaken by traditional methods
.
Most of all,
since the
computers
in a grid are often only available when idle
, a
programmer must deal with computers that can enter and leave
the gr
id as they please.

For this reason,
programmers have
generally
only
approached embarrassingly parallel programs, which exhibit
no dependencie
s between sub
-
problems.

If a computer leaves
the grid, its job is simply sent somewhere else.
Of course,
many other problems can benefit from grid computing

as well
.
One such project is that of

computing phylogenetic trees,
which is used
as a vehicle fo
r exploring the aforementioned

challenges.

II.
P
HYLOGENETIC
T
REES

Phylogenetic trees are graphs depicting how organisms
evolved from one another

over time
. Consequently, the trees
also show how similar organisms are grouped together.

The first step in making
a phylogenetic tree is to create an
alignment of the DNA sequences of the organisms involved.

Most methods
, such as Smith
-
Waterman, use a matrix

to
calculate the alignment between two sequences.

Multiple
alignment is more difficult, and is

outside the sc
ope

of this
project, which uses

an already
-
computed multiple alignment.

Next,
a
ll the p
ossible combinations of

trees are generated

from the multiple alignment
.

Each tree

calculates how many
state changes would be required to produce the alignment
.
Starti
ng at the end sites, the method moves upwards through
the tree, calculating the next level using
the intersection if the
P


2

sites match

and
the union otherwise
.
The tree with the lowest
number of state changes

at the end of the computation
is the
one m
ost li
kely to be correct
.

III.
P
ARALLEL
C
OMPUTATION

As the
number of species

involved grows, the number of
trees generated becomes very large. Therefore, some method
of speeding up the computation is greatly beneficial.

Luckily,
phylogenetic trees offer several poi
nts where
parallelism can
be introduced.

At the highest level, each
of the possible
tree
s

can be
com
puted by a separate processor. If each processor knows
how many other processors there are and which position it is
amongst the processors, the possible tr
ees can be effectively
split up between the processors. Since there is little
communication needed between the no
des
, but

memory
consumption can be rather
high, this method is ideal for a
di
stributed computing environment. Normally this would be a
cluste
r.

Conversely,
t
ree
s

can
also
be split into smaller
sub
-
trees,

which are then

run as separate threads and later combined.

Communication is higher between the threads, but the memory
required is lower. Accordingly, a shared
-
memory
multiprocessor computer
,

or SMP,

is well
-
suited for this
method.

Lastly, b
oth methods can take advantage of keeping track of
the lowest score computed
for a tree
so far
. If a node
calculating a tree from the same set ever exceeds

this score
, it
can stop early

since the score wil
l never get any smaller
.
Obviously
, this adds extra communication that must be
considered
.

IV.

A
DAPTATION FOR
G
RID
C
OMPUTATION

In regards to parallel programming, a grid is
analogous to

a
cluster
in that

the nodes in a grid are computers with their own
memo
ry connected by a network
.
Therefore, a grid

use
s

the
same basic model for parallelism.
However, because grids are
much more loosely defined, they face a number of
unique
issues.

First
,
nodes

in a grid
are

often only available when idle
, so
care must be
taken to account for nodes that
enter and
leave
the grid

at any time. This issue is normally handled by
separation between
tasks

and
jobs
. Tasks are actions the client
wishes to be performed, while jobs are
tasks mapped to actual
nodes. If a node workin
g on a job
leaves the grid
, the
controller simply maps the task to another node.

In the case of our problem,
the tasks submitted to the
controller are the possible trees for the alignment. Of course,
the client should not need to generate the trees

itself
.
He
should only have to submit the alignment.
Therefore, a
meta
-
controller

is added be
tween the client and controller. The
meta
-
controller

takes the alignment from the client,

calculates
how many trees will be generated, and submits the appropriate
tas
ks to the grid’s controller.
T
he trees
are not actually
generated by the meta
-
controller, however,
since this would
significantly

increase the sequential fraction of the run and
greatly
degrade performance.

Next, the issue for keeping track of the lowest
tree can be
handled with a common output file. When an agent calculates
a tree that’s lower than what’s in the file already, it updates the
4005
-
739 Seminar Grid Computing I: Concepts and Practice ∙
http://blackrose02.rit.edu/wiki/doku.php?id=grid:seminar1

3

score and writes its tree to the file. Then, when starting a new
tree, each agent checks this file again for an up
dated score,
ending early if their tree ever exceeds it. Depending on how
long the nodes are running, they may check the file again
during processing.

Finally, the client retrieves the output by downloading the
output file after all the jobs have complete
d.

V.

I
MPLEMENTATION

The application

was

developed
with

the

Java
CoG Kit
for
its

relative ease
-
of
-
use and

cross
-
platform compatibility
.

Job submission was handled using the CoG Kit’s Abstration
Layers, and was modeled after the
JobSubmission

class used
by
t
he
cogrun

and
cog
-
job
-
submit

command
-
line tools.

Unfortunately,
JobSubmission

does not work well with
multiple batch files, as its
TaskHandler

and
StatusListener

objects will continue to wait for
a response

after a job has
been submitted. Even worse, the

submission will
completely
stall after the first job has been submitted
.

Basically,
T
askHandler

and
StatusListener

wait forever, never returning
to their caller.
JobSubmission

deals with the problem by
simply calling
System.exit()

after a status event is

received. Of
course, while it works with one job, it makes it very hard to do
anything afterwards

namely submitting more jobs.

Therefore, the solution was to break up the process into
multiple

components

and

run
them in separate

threads.
Then,
while the

rest of the threads wait, the main thread can go on to
creating more threads. Furthermore, if a thread got to a point
where it was no longer needed, it could
stop itself without
disturbing the rest of program.

In more detail, t
he components consist of
Jo
b
,
JobListener
,
Submitter
, and
Submission

classes. When the client
program
needs to submit a job, it
first specifies a
Submitter

class

the
entity that will actually perform the job submission

essentially

a wrapper for
TaskHandler
.
This technique is

also
beneficial
in that only one
Submitter

class
is created,
so
information
about
submitted
jobs can be retrieved
all
from the
same source.

Of course, the client program also needs to
initialize the
actual job it plans to submit. It does so by
specifying the
h
ost
,
executable
,
arguments
,
output file
, and whether it’s a
batch
job

or not. The job creates a task from the values a
nd holds
everything for later

step
s
.

At this point,
the client program creates a new
Submission
,
passing the
Job

and
Submitter

objects it

just made to the
constructor. The key however, is that
Submission

is actually a
thread.

Therefore,
when it submits the job and waits for its
status, it

will wait instead of the

main thread
, which is then
free to make more submissions
. Furthermore, when

the
submission does receive a status update via
JobListener
, it
calls
currentThread().stop()

to end itself. Conveniently,
meanwhile the main thread has
been calling
join()

on each of
the threads. This technique allows
main

to exit the program
after all
the jobs have been submitted
. (S
ince
TaskHandler

creates a number of inaccessible threads that
would otherwise
still be running.
)

On the other hand,
the process for transferring files is
radically different.
The program uses the CoG Kit’s
Karajan
Workflo
w

model, in which an XML file specifies what
actions

should be executed
.


4

The main advantage for using Karajan for file transfers is its
simplicity. Only the most basic information needs to be
specified to do a transfer, and
it can be easily specified in a

handful of XML tags
.

Plus, Karajan has the added benefit of
having a built
-
in
chmod

element, which simpl
ifies setting file
permissions

especially

useful for uploads.

Unfortunately,

Karajan’s affinity for files can also be a
disadvantage. Separate files
can be made for separate tasks, of
course, but undoubtedly at some point the program will wish
to change part of the file, generally based on user input.

A number of methods for creating and modifying trees
dynamically were experimented with, but ultimatel
y
we

found
the most practical way to be with a new

class called
VariableReplacer
. VariableReplacer

allows the

program to
easily substitute different values in the XML file

by adding the
variable names and values to a
HashMap
. It then uses
Scanner

to pars
e the XML file and make the replacements
,
storing the output as a string so as not to ruin the original file.
Lastly, this string is passed to a
KarajanWorkflow
, which
executes the modified actions

as desired.

VI.

F
UTURE
D
EVELOPMENT

Unfortunately, c
ommunicati
on between the processes was
not accomplished. In the future, the authors would like to
implement the common

output file strategy, although making
sure the agents do not interfere with each other may be a
significant problem.
Also, MPICH
-
G2, an MPI
imple
mentation that works with grids, may be a viable

and
productive
alternative.

VII.

C
ONCLUSIONS

In general the

method
s presented in this paper have proven
useful in parallel programming for grids. Mainly, a method
for effectively submitting multiple batch jobs

was
demonstrated. Also, the Karjan variable replacement
technique will be valuable for any developer wishing to
modify Karajan workflows from within Java.

A
CKNOWLEDGMENT

Andrew Brown thanks Alan Kaminsky of the Rochester
Institute of Technology for his w
ork in
phylogenetic trees in a
parallel environment.

R
EFERENCES

1]

W. Fitch, "Toward Defining the Course of Evolution: Minimum Change
for a Specific Tree Topology,"
Systematic Zoology,
vol. 20, pp. 406
-
416,
1971.

2]

I. Foster and N. Karonis, "A grid
-
enabled MPI:

message passing in
heterogeneous distributed computing systems," in Supercomputing '98:
Proceedings of the 1998 ACM/IEEE conference on Supercomputing
(CDROM), 1998, pp. 1
-
11.

3]

A. Kaminsky, "Phylogenetic Trees Building Blocks."

4]

N. Karonis, B. Toonen, and I.

Foster, "MPICH
-
G2: A Grid
-
Enabled
Implementation of the Message Passing Interface," Argonne National
Laboratory and The University of Chicago2002.

5]

H. Stockinger, M. Pagni, L. Cerutti, and L. Falquet, "Grid Approach to
Embarrassingly Parallel CPU
-
Intensive

Bioinformatics Problems," in
Second IEEE International Conference on e
-
Science and Grid
Computing, 2006, pp. 58
-
58.