Guide for XSEDE Users of OSG

bewgrosseteteSoftware and s/w Development

Dec 13, 2013 (3 years and 5 months ago)

90 views

1


-
DRAFT
-

12
-
dec
-
201
1


Guide for XSEDE Users of OSG

Overview

The Open Science Grid (OSG) promotes science by
(1)
enabling a framework of distributed
computing and storage resources,
(2)
making available
a set of services and methods that
enable better access to ever increasing computing resources for researchers and communities,
and
(3)
providing
principles and software that enable distributed high through
-
put computing
(DHTC) for users and communities at

all scales.
T
he Open Science Grid
does not own the
computing, storage, or network resources used by the scientific community
.

T
he

resources
accessible through the OSG
are contributed by the community, organized by the OSG
,
and
govern
ed by the
OSG Consortium
; an overview is available

at

An Introduction to OSG
. To
day,
the OSG community brings together over 100 sites that
provide

computation and storage
resources; a current view of usage
metrics
is available at the
OSG Usage Display.


OSG supports XSEDE users by providing

a

“Virtual Cluster”
that
forms
an abstraction layer to
acce
ss the distributed OSG fabric. This interface allows XSEDE users to view the OSG as a
single
cluster
to manage
their jobs
, provide the inputs

and retrieve the outputs. XSEDE users
access the OSG via the OSG
-
XSEDE login host which appears as a resource in t
he XSEDE
infrastructure.

Computation that is a good match for OSG

High throughput workflows with simple system and data dependencies are a good fit for OSG.
The
Condor

manual

has an overview of
high

throughput

computing
.


Job
s

submitted into the
OSG V
irtual
C
luster will be executed on
machines at several remote
physical cluster
s
.
These machines
might
have
small
differences of
environment and s
ystem
compared to the submit node. Therefore it is important that the jobs are as
self
-
contained

as
possible by generic binaries and data which can be either carried with the job, or staged on
demand. Please consider the following guidelines:




Software
sho
uld preferably be single threaded, using less than 2 GB memory and each
invocation should run for 4
-
12 hours. There is some support for jobs with longer

run
time, more memory or multi
-
threaded codes. Please contact
the
support

listed below
for
more informa
tion about th
ese

capabilities.




Compute sites in the OSG can be configured to use pre
-
emption, which means jobs can
be automatically killed if higher priority jobs enter the system. Pre
-
empted jobs will
restart on another site, but it is important that the

jobs can handle multiple restarts.




Binaries should preferably be statically linked. However, dynamically linked binaries with
standard library dependencies, built
for a
64
-
bit Red Hat Enterprise Linux (RHEL) 5

machine
s

will
also
work. Also, interpreted l
anguages such as Python or Perl will work as
2


-
DRAFT
-

12
-
dec
-
201
1


long as there are no special module requirements.




Input and output data for each job should be < 10 GB

to allow them
to be pulled in by the
jobs, processed and pushed back to the submit node. Note that the
OSG
V
irtual
C
luster
does

not
currently
have a global shared file system, so jobs with such dependencies will
not work.




Dependencies on software can be difficult
to accommodate
unless the software can be
staged with the job.

How to get consultation and support

for using OSG

XSEDE users of OSG get technical support by contacting OSG User Support staff at email

osg
-
x
sede
-
support@opensciencegrid.org

.


System Configuration

The OSG Virtual Cluster is a Condo
r pool overlay on top of OSG resources. The pool is
dynamically sized based on the demand, the number of jobs in the queue, and supply, resource
availability at the OSG resources. It is expected that the number of resources, on average,
available to XSEDE
users will be
in the order of
1,000 cores.


One important difference between the OSG Virtual Cluster and most of the other XSEDE
resources is that the OSG Virtual Cluster does not have a shared file system.


Local storage
space at the submission site
is c
ontrolled

by quota
. Your home directory has a
quota of 10 GBs and your work directory /local
-
scratch/$USER has a quota of 1 TB.

System Access

The
OSG Virtual Cluster supports Single Sign On through the XSEDE User Portal
,

and
also
from the command line using
gsissh with a grid certificate for authentication.


To log in via the XSEDE User Portal, log in to the portal and use the login link on the "accounts"
tab there.


To log in via gsissh:


gsissh xsede
-
login.opensciencegrid.org

App
lication Development

Most of the clusters in OSG are running Red Hat Enterprise Linux (RHEL) 5 or some derivative
thereof. For your application to work well in this environment, it is recommend that the
3


-
DRAFT
-

12
-
dec
-
201
1


application is compiled on a similar system, for exam
ple on
the
OSG Virtual Cluster login
system, xsede
-
login.opensciencegrid.org It is also recommended that the application be
statically linked, or alternatively dynamically linked against just a few standard libraries. What
libraries
a binary

depends on ca
n be checked using the ldd utility.


In the case of interpreted languages like Python and Perl, applications have to either use only
standard modules, or be able to ship the modules with the jobs.

Please note that different
compute nodes might have differe
nt versions of these tools installed.

Running Your Application

The OSG Virtual Cluster is based on Condor and the
Condor

manual

provides a reference for
command line tools.


Submitting a Simple Job

Below is a basic job for the Virtual Cluster.


-------------

universe = vanilla


# requirements is an expression to
specify machines that can run jobs

# You should have Arch be at least "INTEL" and "X86_64"

# as most machines in the grid are 64
-
bit.

requirements = (FileSystemDomain != "") && (Memory >= 1 && OpSys == "LINUX" ) && (Arch
== "X86_64")


executable = /bin/
hostname

arguments =
-
f


should_transfer_files = YES

WhenToTransferOutput = ON_EXIT


output = job.out

error = job.err

log = job.log


notification = NEVER


queue

-------------


Create a file named job.condor, and then run:


condor_submit job.condor




4


-
DRAFT
-

12
-
dec
-
201
1


Monitoring Jobs

You can view the queue by running:


condor_q


To limit it to just your jobs:


condor_q $USER


Data Staging

For data staging you can either use Condor I/O, or globus
-
url
-
copy. For Condor I/O, use the
transfer_input_files

and
transfer_output_
files

attributes in the Condor submit file. For example:


transfer_input_files = input.1, input.2

transfer_output_files = output.tar.gz


Note that the job will fail if not all listed output files exists after running the executable. If you
have a lot of sm
all inputs/outputs, you can tar them up using a job wrapper.


To use globus
-
url
-
copy, you will have to transfer your XSEDE credentials to xsede
-
login.opensciencegrid.org . You should then be able to run grid
-
proxy
-
init/grid
-
proxy
-
info:


grid
-
proxy
-
init
-
va
lid 96:00

grid
-
proxy
-
info


Your proxy will automatically be picked up by Condor and shipped with your jobs. You can then
use globus
-
url
-
copy in the job wrapper to pull in inputs, and push out outputs.

DagMan and Sample DagMan Workflow

DagMan

is a Condor workflow tool. It allows the creation of a directed acyclic graph of jobs to be
run, and then DagMan submits and manages the jobs.


Dagman

is also useful if you have a large
number
of jobs, even if there are no job inter
-
dependencies, as DagMan can keep track of failures and provide a restart mechanism if part of
the workflow fails.


(Sample workflow will be provided at a later time)