The ATLAS PanDA Pilot in Operation

pleasantrollSecurity

Feb 16, 2014 (3 years and 3 months ago)

68 views

Notes

T
he PanDA Pilot in
Operation

P. Nilsson for the ATLAS Collaboration

University of Texas at Arlington

Abstract

The ATLAS
P
roduction
an
d
D
istributed
A
nalysis system (PanDA) [1] was designed to meet ATLAS [2]
requirements for a data
-
driven workload
management system capable of operating at LHC data
processing scale. Submitted jobs are executed on worker nodes by pilot jobs sent to the grid sites by
pilot factories. This poster provides an overview of the PanDA pilot [3] system and presents major
feat
ures added in light of recent operational experience, including multi
-
job processing, advanced
job recovery for jobs with output storage failures, gLExec [4] based identity switching from the
generic pilot to the actual user, and other security measures. T
he PanDA system serves all ATLAS
distributed processing and is the primary system for distributed analysis; it is currently used at over
100 sites world
-
wide. We analyze the performance of the pilot system in processing real LHC data on
the OSG [5], EGI [6
] and Nordugrid [7] infrastructures used by ATLAS, and describe plans for its
evolution.

Section: Introduction

Data analysis using grid resources was one of the fundamental challenges to address before the start
of LHC data taking
.

ATLAS will produce petab
ytes of data per year
.

Data needs to be distributed to sites worldwide
.

More than one thousand users will need to run analyses on this data

.

Analysis client tools and grid infrastructure have been mature since several years
.

[Illustration, from raw data

to user analysis]

The PanDA system is a workload management system developed to meet the needs of ATLAS
distributed computing. The pilot
-
based system has proven to be very successful in managing the
ATLAS distributed production requirements, and has been
extended to manage distributed analysis
on all three ATLAS grids.

[Illustration, PanDA system]

Jobs are submitted to the PanDA server by users and the production system. Pilot factories send jobs
containing a thin wrapper to grid computing sites. This wrap
per downloads the PanDA pilot code.
The pilot then downloads and executes an ATLAS job.

Main tasks of the PanDA pilot


1)

Verifying WN environment, disk space, file sizes, etc,
2)

Setting up runtime environment,
3)

Transferring input files,
4)

Executing the payload,
5)

Transferring output files,
6)

Communicating job
status to the PanDA server,
7)

Cleaning up.

Section:
Special Pilot Features



Job and disk monitoring


Output files are monitored for size and that they are being written to. Is the
re enough
local space to finish the job? Is a user job within the allowed limits?



Job recovery


If a job has run for many hours and only fails during stage
-
out, is it ok to abandon an
entire job and waste all the spent CPU time? If a pilot fails to upload
the output files for an otherwise
completed job, it can optionally leave the output on the local disk for a later pilot to re
-
attempt the
transfer.



Multi
-
job processing


A single pilot can run several short jobs sequentially until it runs out of time (pre
-
determined). Useful for reducing the pilot rate when there are many short jobs in the system. It is
used both with production and user analysis jobs.



General security measures


All communications with the PanDA server use certificates. During job download,

a
pilot can be considered legitimate if it presents a special time limited token to the job dispatcher that
was previously generated by the server.



Optional gLExec security


To prevent a user job from attempting to use the pilot credentials (which has man
y
privileges), gLExec can switch the identity to the user prior to job execution. The user job is thus
executed with the users’ normal credentials that are limited.

Section: Performance

PanDA is serving 40
-
50k+ concurrent production jobs world wide. At the

same time it is also serving
user analysis jobs.
These jobs, which are chaotic by nature, have recently reached peaks of 27k
concurrent running jobs.


[Graphs:
Concurrent production and user analysis jobs in the system in 2009
-
10.
]

[Graph:
Reported errors

during September 2010
.
]

The error rate in the entire system is on the level of 10 percent. The majority of these errors are site
or system related, while the rest are problems with ATLAS software. The PanDA pilot can currently
identify about 100 different

error types that are reported back to the server as they happen. A web
based PanDA monitor [8] can be used to reach individual jobs and their log files which is highly
useful, especially for debugging problematic jobs.

Section

: Evolution

Since the beginn
ing of the PanDA project in 2005, a steady stream of new features and modifications
have been added to the PanDA pilot. Partial refactoring of the code has occasionally been necessary,
e.g. to allow easy integration of gLExec, as well as an ongoing cleanup

and general improvement of
older code. Overall the PanDA pilot has a robust and stable object oriented structure which will
continue to smoothly allow for future improvements and add
-
ons.

Recent or upcoming new features include



Plug
-
ins


The pilots’ modul
ar design makes it easy to add new features. E.g. a plug
-
in for WN
resource monitoring using the STOMP framework [9] is currently being developed by people outside
the PanDA team.



CernVM integration


This project aims to use CernVM [10] nodes to run ATLAS
jobs on volunteer cloud
computing resources. In collaboration with the CernVM team.




Memory usage monitoring



Next generation job recovery



Extension of pilot release candidate testing framework to HammerCloud
[11]



Configuration of pilot using a special file



Migration to Python 2.6

Section: Conclusions/Summary

Section: References

[1] PanDA: https://twiki.cern.ch/twiki/bin/view/Atlas/Panda

[2] The ATLAS Experiment: http://atlas.web.cern.ch/Atlas

[3] The PanDA Pilot:
https://twiki.cern.ch/twiki/bin/view/Atlas/PandaPilot

[4] gLExec: https://www.nikhef.nl/pub/projects/grid/gridwiki/index.php/GLExec

[5] Open Science Grid: http://www.opensciencegrid.org/

[6]

EGI: http://www.egi.eu
/

[7] NorduGrid:
http://www.nordugrid.org/


[8] PanDA Monitor: http://panda.cern.ch

[9] STOMP:
http://stomp.codehaus.org/


[10] CernVM:
http://cernvm.cern.ch/cernvm/


[11] HammerCloud: http://hammercloud.cern.ch/