applications by P-GRADE

capricioustelephoneUrban and Civil

Nov 16, 2013 (3 years and 11 months ago)

134 views

Computer and Automation Research Institute

Hungarian Academy of Sciences



Automatic checkpoint of CONDOR
-
PVM
applications by P
-
GRADE

J
o
zsef Kov
a
cs
, Peter Kacsuk

Laboratory of Parallel and Distributed Systems

MTA SZTAKI
, Budapest, Hungary

{
smith, kacsuk}@sztaki.hu

http://www.lpds.sztaki.hu

14
-
16th April
200
4 Paradyn/Condor week, Madison, USA

Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE


2


2002 Hungarian Ministry of Education, NIIF


procurement project to
equip universities, high schools, public libraries with PC labs.


More than 2000 PCs, which were considered to be enormous,
computational resources had been spread over the country.


Grid Technical Board


the goal was to build up a minimal, but
functional grid system.


Dual
-
boot PC labs are connected throughout the country. Day
-
time
operation


Windows desktop use, night
-
time operation


grid mode
use. 24 hours operational “grid backbone” infrastructure.


Around 800 PCs are interconnected at 400 Gflops performance via
private networking solution (MPLS VPN) over the academic network.


1
st

generation ClusterGrid


a single large Condor pool


2
nd

generation ClusterGrid


a Condor based grid connected by web
service and transaction based.

Background: The Hungarian ClusterGrid

14
-
16th April
200
4 Paradyn/Condor week, Madison, USA

Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE


3


Condor pools are connected by a global Grid Resource Broker
which uses dynamic UID/GID mapping for user jobs, and “one
job


one directory structure” job format.


Scalable, easy to manage system.


In production since July 2003 with more than 30000 real user
jobs executed.


Applications range from fundamental research (mathematics,
physics) to applied research (biology, chemistry).


investigation of C60 molecule in electromagnetic fields


simulation of protein molecules


fractal calculation


investigation of imbalanced phase transitions


etc.


Two classes of applications are currently supported: parameter
scanning, and master
-
worker jobs parallelized by PVM.


For more info, http://www.clustergrid.iif.hu.

Hungarian ClusterGrid Infrastructure

14
-
16th April
200
4 Paradyn/Condor week, Madison, USA

Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE


4

Hungarian ClusterGrid Infrastructure

14
-
16th April
200
4 Paradyn/Condor week, Madison, USA

Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE


5

Motivation

Checkpointing and migration support is necessary


To enable load balancing and


To support fault
-
tolerance


To support day
-
night working mode of Hungarian ClusterGrid

etc.


Automatic checkpointing for sequential jobs

in standard universe is provided
by Condor

Fault
-
tolerant execution of Master
-
Worker

style parallel jobs are supported
without automatic checkpointing

With the P
-
GRADE

environment Condor is able to make
automatic
checkpointing for PVM jobs

to enable load
-
balancing and to make long
running worker processes fault
-
tolerant

14
-
16th April
200
4 Paradyn/Condor week, Madison, USA

Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE


6

P
-
GRADE environment

P
arallel
G
rid

R
un
-
time and
A
pplication
D
evelopment

E
nvironment

14
-
16th April
200
4 Paradyn/Condor week, Madison, USA

Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE


7

Using P
-
GRADE job mode for the whole range
of parallel/distributed systems

P
-
GRADE

PVM

MPI

Workflow

Super
-

computers

Clusters

Grids

CondorGrid

GT2 Grid

OGSA

14
-
16th April
200
4 Paradyn/Condor week, Madison, USA

Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE


8

P
-
GRADE and Condor

14
-
16th April
200
4 Paradyn/Condor week, Madison, USA

Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE


9

Current prototype for migration framework


First prototype is currently based on


P
-
GRADE


Condor


PVM



Requirements


No manual code preparation is required


No user interaction during execution


No PVM modification


No extra requirements from schedulers


Just build your application using P
-
GRADE

14
-
16th April
200
4 Paradyn/Condor week, Madison, USA

Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE


10

Structure of
P
-
GRADE
application

Built
-
in

server

c
lien
t

A

c
lien
t

B

c
lien
t

D

c
lien
t

C

Server


process

spawn/terminate


identification/topolo
g
y


access to terminal/files

C
lien
ts


identification of neighbors by the server


access to files/terminal through the server


primitives for communication

messag
e

passing

Terminal

Files

14
-
16th April
200
4 Paradyn/Condor week, Madison, USA

Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE


11

Checkpointing a single process

1.
Initiate a checkpoint

2.
Synchronize transit
messages and
disconnect MP

3.
Collect address
-
space information

4.
Send checkpoint

5.
Store checkpoint
onto server

6.
Reconnect to MP

User process

Checkpoint

Server

Storage

handle MP

1

2

3

4

5

6

ckpt lib

handle MP

Vic Zandy’s single process checkpointer:

www.cs.wisc.edu/~zandy/ckpt

©
University of Wisconsin, Madison

(former member of the Paradyn group)

14
-
16th April
200
4 Paradyn/Condor week, Madison, USA

Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE


12

Modified structure to checkpoint processes

S
erver
/

coordination
module

C
lien
t
A

C
lien
t
D

C
lien
t B

message
passing
library

File
s

Checkpoint

Server

Storage

ckpt lib

ckpt lib

ckpt lib

ckpt lib

Terminal

Client

C

ckpt

lib

user

co
d
e

comm lib

mp lib

14
-
16th April
200
4 Paradyn/Condor week, Madison, USA

Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE


13

Migration among friendly condor pools

Step
1
:
Starting the application

S: Server

CS: Checkpoint Server

P: PVM daemon

A,B,C:
User

processes

14
-
16th April
200
4 Paradyn/Condor week, Madison, USA

Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE


14

Step
2
:
Condor is vacating a node

S: Server

CS: Checkpoint Server

P: PVM daemon

A,B,C: User processes

Migration among friendly condor pools

14
-
16th April
200
4 Paradyn/Condor week, Madison, USA

Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE


15

Step
3
:
Checkpointing processes

S: Server

CS: Checkpoint Server

P: PVM daemon

A,B,C: User processes

Migration among friendly condor pools

14
-
16th April
200
4 Paradyn/Condor week, Madison, USA

Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE


16

S: Server

CS: Checkpoint Server

P: PVM daemon

A,B,C: User processes

Step
4
:
Process resumed on friendly Condor pool

Migration among friendly condor pools

14
-
16th April
200
4 Paradyn/Condor week, Madison, USA

Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE


17

Live demonstrations



The prototype has been
demonstrated in various
conferences/ workshops



EuroPar’03,

Klagenfurt, Austria


Hungarian Grid Day,

Budapest, Hungary


SuperComputing 2003,

Phoenix, USA


Cluster 2003,

Hong
-
kong, China

14
-
16th April
200
4 Paradyn/Condor week, Madison, USA

Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE


18


P
-
GRADE

GUI


London
-

UoW

Budapest
-

SZTAKI

1


P
-
GRADE program
submitted to Budapest as a
Condor job

2

P
-
GRADE program
runs

at SZTAKI

cluster

3

P
-
GRADE program

migrates

to London

as a Condor job

4

P
-
GRADE
program
runs at
UoW

cluster

Budapest
-

BUTE

SZTAKI & BUTE
clusters

overloaded



checkpointing

Possible scenario on checkpointing and migration


of PGRADE

programs between clusters

14
-
16th April
200
4 Paradyn/Condor week, Madison, USA

Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE


19

Integrated checkpoint and monitor


The checkpoint
system is cooperating
with the

GRM
-
Mercury
-
PROVE
monitoring and
visualisation system


logs out the user
process from the
monitoring layer
before termination


logs in the user
process into the
monitoring layer
after resumption


user can trace the
machines where
process migrated

14
-
16th April
200
4 Paradyn/Condor week, Madison, USA

Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE


20






Migration among non
-
friendly Condor pools

(under development)

5. Auto self
-
recovery

of PGRADE application

4. Submit application

to the queue

3. Transfer binaries,
checkpoint files, work files

1. Detection of low

resources on cluster

2. Removal of application

from the queue

P
-
GRADE

environment

GRID

Application Manager

CONDOR pool B

CONDOR pool A

It requires consultation with
CONDOR developers…

14
-
16th April
200
4 Paradyn/Condor week, Madison, USA

Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE


21

Summary of advantages/disadvantages


Advantages


no modification of the grid execution environment is required, since all
checkpointing/migration capability is built inside the application


supports the day
-
night working mode in the Hungarian ClusterGrid
environment


adaptivity and automation comes from Condor


Condor
-
PVM applications, with topology of any kind, can now be dynamically
migrated like sequential jobs

(Note: Condor does not checkpoint PVM applications,

only fault
-
tolerant execution is supported for Master
-
Worker type
applications)


migrating jobs can be monitored online and visualised


Limitations


currently PGRADE generated PVM jobs are supported

14
-
16th April
200
4 Paradyn/Condor week, Madison, USA

Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE


22

Conclusion


A parallel program checkpointing mechanism that
can be applied to
generic PVM programs
.


A checkpointing mechanism that can be connected
to Condor in order to realize migration of PVM jobs
among Condor pools
.


By integrating P
-
GRADE migration framework and
the Mercury Grid monitor, PVM applications can be
performance
monitored and visualized

even during
their migration.


Condor
-
PVM, through our checkpointing algorithm,
is
enhanced to checkpoint PVM

applications like it is
done for sequential jobs.

14
-
16th April
200
4 Paradyn/Condor week, Madison, USA

Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE


23

Thank you for your attention!


Jozsef Kovacs <smith@sztaki.hu>

Information about P
-
GRADE:

pgrade@sztaki.hu

http://www.lpds.sztaki.hu/pgrade

Next release is coming at the end of April…

Information about Hungarian ClusterGrid:

http://www.clustergrid.iif.hu