Computer and Automation Research Institute
Hungarian Academy of Sciences
Automatic checkpoint of CONDOR
-
PVM
applications by P
-
GRADE
J
o
zsef Kov
a
cs
, Peter Kacsuk
Laboratory of Parallel and Distributed Systems
MTA SZTAKI
, Budapest, Hungary
{
smith, kacsuk}@sztaki.hu
http://www.lpds.sztaki.hu
14
-
16th April
200
4 Paradyn/Condor week, Madison, USA
Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE
2
•
2002 Hungarian Ministry of Education, NIIF
–
procurement project to
equip universities, high schools, public libraries with PC labs.
•
More than 2000 PCs, which were considered to be enormous,
computational resources had been spread over the country.
•
Grid Technical Board
–
the goal was to build up a minimal, but
functional grid system.
•
Dual
-
boot PC labs are connected throughout the country. Day
-
time
operation
–
Windows desktop use, night
-
time operation
–
grid mode
use. 24 hours operational “grid backbone” infrastructure.
•
Around 800 PCs are interconnected at 400 Gflops performance via
private networking solution (MPLS VPN) over the academic network.
•
1
st
generation ClusterGrid
–
a single large Condor pool
•
2
nd
generation ClusterGrid
–
a Condor based grid connected by web
service and transaction based.
Background: The Hungarian ClusterGrid
14
-
16th April
200
4 Paradyn/Condor week, Madison, USA
Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE
3
•
Condor pools are connected by a global Grid Resource Broker
which uses dynamic UID/GID mapping for user jobs, and “one
job
–
one directory structure” job format.
•
Scalable, easy to manage system.
•
In production since July 2003 with more than 30000 real user
jobs executed.
•
Applications range from fundamental research (mathematics,
physics) to applied research (biology, chemistry).
–
investigation of C60 molecule in electromagnetic fields
–
simulation of protein molecules
–
fractal calculation
–
investigation of imbalanced phase transitions
–
etc.
•
Two classes of applications are currently supported: parameter
scanning, and master
-
worker jobs parallelized by PVM.
•
For more info, http://www.clustergrid.iif.hu.
Hungarian ClusterGrid Infrastructure
14
-
16th April
200
4 Paradyn/Condor week, Madison, USA
Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE
4
Hungarian ClusterGrid Infrastructure
14
-
16th April
200
4 Paradyn/Condor week, Madison, USA
Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE
5
Motivation
Checkpointing and migration support is necessary
•
To enable load balancing and
•
To support fault
-
tolerance
•
To support day
-
night working mode of Hungarian ClusterGrid
etc.
Automatic checkpointing for sequential jobs
in standard universe is provided
by Condor
Fault
-
tolerant execution of Master
-
Worker
style parallel jobs are supported
without automatic checkpointing
With the P
-
GRADE
environment Condor is able to make
automatic
checkpointing for PVM jobs
to enable load
-
balancing and to make long
running worker processes fault
-
tolerant
14
-
16th April
200
4 Paradyn/Condor week, Madison, USA
Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE
6
P
-
GRADE environment
P
arallel
G
rid
R
un
-
time and
A
pplication
D
evelopment
E
nvironment
14
-
16th April
200
4 Paradyn/Condor week, Madison, USA
Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE
7
Using P
-
GRADE job mode for the whole range
of parallel/distributed systems
P
-
GRADE
PVM
MPI
Workflow
Super
-
computers
Clusters
Grids
CondorGrid
GT2 Grid
OGSA
14
-
16th April
200
4 Paradyn/Condor week, Madison, USA
Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE
8
P
-
GRADE and Condor
14
-
16th April
200
4 Paradyn/Condor week, Madison, USA
Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE
9
Current prototype for migration framework
•
First prototype is currently based on
–
P
-
GRADE
–
Condor
–
PVM
•
Requirements
–
No manual code preparation is required
–
No user interaction during execution
–
No PVM modification
–
No extra requirements from schedulers
–
Just build your application using P
-
GRADE
14
-
16th April
200
4 Paradyn/Condor week, Madison, USA
Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE
10
Structure of
P
-
GRADE
application
Built
-
in
server
c
lien
t
A
c
lien
t
B
c
lien
t
D
c
lien
t
C
Server
•
process
spawn/terminate
•
identification/topolo
g
y
•
access to terminal/files
C
lien
ts
•
identification of neighbors by the server
•
access to files/terminal through the server
•
primitives for communication
messag
e
passing
Terminal
Files
14
-
16th April
200
4 Paradyn/Condor week, Madison, USA
Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE
11
Checkpointing a single process
1.
Initiate a checkpoint
2.
Synchronize transit
messages and
disconnect MP
3.
Collect address
-
space information
4.
Send checkpoint
5.
Store checkpoint
onto server
6.
Reconnect to MP
User process
Checkpoint
Server
Storage
handle MP
1
2
3
4
5
6
ckpt lib
handle MP
Vic Zandy’s single process checkpointer:
www.cs.wisc.edu/~zandy/ckpt
©
University of Wisconsin, Madison
(former member of the Paradyn group)
14
-
16th April
200
4 Paradyn/Condor week, Madison, USA
Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE
12
Modified structure to checkpoint processes
S
erver
/
coordination
module
C
lien
t
A
C
lien
t
D
C
lien
t B
message
passing
library
File
s
Checkpoint
Server
Storage
ckpt lib
ckpt lib
ckpt lib
ckpt lib
Terminal
Client
C
ckpt
lib
user
co
d
e
comm lib
mp lib
14
-
16th April
200
4 Paradyn/Condor week, Madison, USA
Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE
13
Migration among friendly condor pools
Step
1
:
Starting the application
S: Server
CS: Checkpoint Server
P: PVM daemon
A,B,C:
User
processes
14
-
16th April
200
4 Paradyn/Condor week, Madison, USA
Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE
14
Step
2
:
Condor is vacating a node
S: Server
CS: Checkpoint Server
P: PVM daemon
A,B,C: User processes
Migration among friendly condor pools
14
-
16th April
200
4 Paradyn/Condor week, Madison, USA
Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE
15
Step
3
:
Checkpointing processes
S: Server
CS: Checkpoint Server
P: PVM daemon
A,B,C: User processes
Migration among friendly condor pools
14
-
16th April
200
4 Paradyn/Condor week, Madison, USA
Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE
16
S: Server
CS: Checkpoint Server
P: PVM daemon
A,B,C: User processes
Step
4
:
Process resumed on friendly Condor pool
Migration among friendly condor pools
14
-
16th April
200
4 Paradyn/Condor week, Madison, USA
Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE
17
Live demonstrations
•
The prototype has been
demonstrated in various
conferences/ workshops
•
EuroPar’03,
Klagenfurt, Austria
•
Hungarian Grid Day,
Budapest, Hungary
•
SuperComputing 2003,
Phoenix, USA
•
Cluster 2003,
Hong
-
kong, China
14
-
16th April
200
4 Paradyn/Condor week, Madison, USA
Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE
18
P
-
GRADE
GUI
London
-
UoW
Budapest
-
SZTAKI
1
P
-
GRADE program
submitted to Budapest as a
Condor job
2
P
-
GRADE program
runs
at SZTAKI
cluster
3
P
-
GRADE program
migrates
to London
as a Condor job
4
P
-
GRADE
program
runs at
UoW
cluster
Budapest
-
BUTE
SZTAKI & BUTE
clusters
overloaded
checkpointing
Possible scenario on checkpointing and migration
of PGRADE
programs between clusters
14
-
16th April
200
4 Paradyn/Condor week, Madison, USA
Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE
19
Integrated checkpoint and monitor
•
The checkpoint
system is cooperating
with the
GRM
-
Mercury
-
PROVE
monitoring and
visualisation system
–
logs out the user
process from the
monitoring layer
before termination
–
logs in the user
process into the
monitoring layer
after resumption
–
user can trace the
machines where
process migrated
14
-
16th April
200
4 Paradyn/Condor week, Madison, USA
Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE
20
Migration among non
-
friendly Condor pools
(under development)
5. Auto self
-
recovery
of PGRADE application
4. Submit application
to the queue
3. Transfer binaries,
checkpoint files, work files
1. Detection of low
resources on cluster
2. Removal of application
from the queue
P
-
GRADE
environment
GRID
Application Manager
CONDOR pool B
CONDOR pool A
It requires consultation with
CONDOR developers…
14
-
16th April
200
4 Paradyn/Condor week, Madison, USA
Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE
21
Summary of advantages/disadvantages
•
Advantages
–
no modification of the grid execution environment is required, since all
checkpointing/migration capability is built inside the application
–
supports the day
-
night working mode in the Hungarian ClusterGrid
environment
–
adaptivity and automation comes from Condor
–
Condor
-
PVM applications, with topology of any kind, can now be dynamically
migrated like sequential jobs
(Note: Condor does not checkpoint PVM applications,
only fault
-
tolerant execution is supported for Master
-
Worker type
applications)
–
migrating jobs can be monitored online and visualised
•
Limitations
–
currently PGRADE generated PVM jobs are supported
14
-
16th April
200
4 Paradyn/Condor week, Madison, USA
Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE
22
Conclusion
•
A parallel program checkpointing mechanism that
can be applied to
generic PVM programs
.
•
A checkpointing mechanism that can be connected
to Condor in order to realize migration of PVM jobs
among Condor pools
.
•
By integrating P
-
GRADE migration framework and
the Mercury Grid monitor, PVM applications can be
performance
monitored and visualized
even during
their migration.
•
Condor
-
PVM, through our checkpointing algorithm,
is
enhanced to checkpoint PVM
applications like it is
done for sequential jobs.
14
-
16th April
200
4 Paradyn/Condor week, Madison, USA
Automatic checkpoint of Condor
-
PVM applications by P
-
GRADE
23
Thank you for your attention!
Jozsef Kovacs <smith@sztaki.hu>
Information about P
-
GRADE:
pgrade@sztaki.hu
http://www.lpds.sztaki.hu/pgrade
Next release is coming at the end of April…
Information about Hungarian ClusterGrid:
http://www.clustergrid.iif.hu
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο