GriPhyN Project Planning Document

flameluxuriantData Management

Dec 16, 2012 (4 years and 8 months ago)

299 views


1


Technical Report GriPhyN
-
2001
-
xx

www.griphyn.org




DRAFT: COMMENTS AND MATERIAL SOLICITED



GriPhyN Year 2


CMS Project Plan

Draft Version
1


26 November

2001


Developed by members of the GriPhyN Project Team

Submit changes and material to: Mike Wil
de, editor

wilde@mcs.anl.gov

1

Overview

Two of the major scientific softwareThis project plan details CMS activities for GriPhyN project year 2, in which
we will integrate GriPhyN research results in virtual into CMS

simulation production, and begin to experiment with
applying virtual data concepts to the CMS analysis problem.

For information on CMS work for Years 3
-
5, and for how this plan fits into the overall GriPhyN plan and the
activities of other GriPhyN experim
ents, see the GriPhyN Overall Project Plan.

1.1

CMS GriPhyN Goals for Year 2

The high
-
level goals for year 2 of the project are:

1)

Show the utility of GriPhyN technology as a basis for enhancing the robustness and reproducibility of
distributed computing, by int
egrating grid components more deeply into the CMS production software, with the
results of the integration being used either in real production or in challenge demos. [UF taking lead, heavy
interaction with PPDG, US production, FNAL]

2)

Gain and demonstrate t
he commitment of CMS to the evaluation, testing, integration, and use of GriPhyN
technologies.

3)

Create a testbed in which both GriPhyN and PPDG CMS activities can be conducted.

4)

Forge joint activities and tighter coordination between PPDG, European Data Grid
, Teragrid, and GriPhyN (as
CMS is part of all four grid projects).


2

5)

Apply GriPhyN virtual data research to CMS simulation production. Deploy the virtual data mechanism and use
it to provide automated production as well as a GriPhyN laboratory.

6)

Integrate t
he virtual data catalog into an important production application, and demonstrate the benefits of
detailed data derivation tracking and large
-
scale data re
-
derivation

7)

Instrument and measure the use of virtual data and request planning mechanisms to gather
data and feedback for
further CS research.

8)

Apply preliminary GriPhyN research results and execution planning and scheduling mechanisms (for example,
using DAGman) to CMS production.

9)

Start exploring the CMS analysis process, creating prototypes of GriPhyN
-
b
ased analysis systems that can lead
the way to live science use in project year 3.

10)

Create compelling new demos, always available, and runnable via the web, for outreach and for SC.

11)

Use CMS activities for education and outreach (move to activity section: by

creating case studies, giving talks
to emerging (and smaller) science projects; talking to under
-
funded institutions about how they can leverage
technology resources in other grids; offer live unused grid cycles to small institutions, even in a demo conte
xt;
allow access to cms simulation data and grid tools to small physics institutions)

Technical goals: The following are technical proof
-
of
-
concept goals for the GriPhyN CS research areas in project
year 2:

1)

adjust GDMP/GridFTP/RFT interfaces, architectural

relationship, and integration for maximum application
benefit

2)

get a production catalog infrastructure in place (including catalogs for replicas, virtual data, application
-
level
metadata)

3)

test a deployment of the replica location service (RLS)

4)

Permit virtu
al data requests entered at any location to be executed at many locations

5)

deploy a scalable relational database solution for replica and virtual data catalogs

6)

understand CMS data model dependency tracking issues

7)

explore the use of scripts to generate virtu
al derived data specifications

8)

understand object
-
level dependency tracking issues

9)

create and deploy rudimentary execution planners

10)

get rudimentary policy control mechanisms in place (at least for disk space, and to some extent for CPU)

11)

deploy effective log
ging and monitoring mechanisms to perform grid job logging and tracing

1.2

CMS GriPhyN Project Year 2 Activity Overview

The CMS activities for year 2 will consist of testbed development, integration of VDT technology into CMS
simulation prodiction, and the pro
totyping of using VDT technology in the CMS analysis process. These are
described in the sections below.

1.2.1

Create a GriPhyN Deployment Testbed for CMS

Several grids will be involved in GriPhyN Y2 CMS activities:

GriPhyN Test Grid
: a shared GriPhyN experiment
al grid to be used by all experiments, for the initial stages of
prototyping. This is where virtual data software and the VDT is first developed. It includes AFS and ability to swap
OS's.

USCMS Production Test Grid
: CMS
-
only experimental testbed to be shar
ed by GriPhyN and PPDG, to be used by
CMS collaborators to prototype mehanims to run CMS production and distributed analysis.

In the USCMS Grid, sites will take on roles as follows:


Tier 1: FNAL


3


Tier 2: CIT, UFL, UW


Tier 3: U of Chicago, UCSD

T
eragrid

and
LHC Computing Grid
: Eventually both NSF Teragrid machines and LHC computing grid machines
should be integrated into the USCMS production grid. In PY2 we will lay the project management plans to make
this happen. For the Teragrid, at least, this

will involve porting and testing efforts to enable CMS production
software to run on IA64 architectures and on different versions of Linux beyond Red Hat 6.2.

1.2.2

Integrate GriPhyN Virtual Data technology into the CMS Monte Carlo simulation production
process

The main GriPhyN CMS effort for Y2 is to integrate virtual data and request planning mechanisms into the CMS
Monte Carlo Simulation Production system (for which current US CMS plans includes the production use of tools
such as IMPALA, MOP, and BOSS). As p
art of this effort, we need to bring together several production job
management mechanisms into a single one, applying GriPhyN VDT components for tangible benefits to CMS and
research feedback to the GriPhyN project. GriPhyN Grid technologies that we inten
d to integrate into this
mechanism include replication, replica location service, reliable file transfer, the Community Authorization Service,
and prototypes of the GriPhyN Job Execution Planner.

The benefits to CMS of this effort are: easier, more automa
ted job submission; easier recalculation of data products;
accurate tracking of data derivation; ability to re
-
derive latter stage outputs of the production data pipeline without
the recomputing of the earlier phases, in cases where the later
-
stage process
ing programs require changes. We intend
to highlight ease of grid usage as a major benefit to CMS, especially in the automated handling of failures and
complex grid configuration and usage details. We also intent to explore how to effectively utilize the G
DMP
publish
-
subscribe paradigm within CMS production.

The benefits to GriPhyN of this effort are: live testing of a fundamental GriPhyN concept in intensive production for
real users; user feedback on the value of the virtual data paradigm and usability of

the tool set; measurement of
virtual data process effectiveness; capture of live logs of the detailed activity and resource utilizations of the
production process for further CS research.

1.2.3

Prototype the application of GriPhyN Virtual Data technology to the

CMS analysis process

In this activity we will begin to explore the vital later
-
phase CMS process of “analysis”


the combing of massive
numbers of events for the signature patterns that offer supporting evidence of the various theories of nature being
stu
died. Once CMS is taking live data from the detector, analysis activities will be the dominant use of computing
resources.

In a typical analysis, a physicist would select events from a large TAG database, gather the full event from various
sources, and cre
ate(reconstruct) those events not already existing on some "convenient" storage system. The process
would work something like this:


1)

Search through 10^9 TAG events (~1 TB), select 10^6 events and get the full event data for them (~2 TB

2)

Do a "new improved"
reconstruction pass, make a new set of analysis objects (~100 GB) and TAGs (~1 GB)

3)

Analyze the new AOD and TAG datasets interactively and extract a few hundred signal events.

4)

Histogram the results and visualize some of the "interesting" events.

We seek to
create virtual data techniques to track data dependencies for the files and/or objects in this process from
tag schemas and tag databases or tables back to the reconstructed event sets and possibly back to the raw data.

We will build tools for this type of

user
-
driven fine
-
granularity physics dataset extraction and transport over the
grid, driven by an easy
-
to
-
understand interface that reduces the difficulties of marshalling distributed grid resources.
This effort will exploit newer collection management fe
atures developed by COBRA/CARF team at CERN.

We expect to work further on exploring the impact of end
-
user physics analysis workloads on the grid system, by
prototyping distributed end user analysis tools, demos, and pilot facilities.

1.3

The GriPhyN CMS Tea
m

Name

Affiliation

Role


4

James Amundson

Lothar Bauerdick

Dimitri Bourilkov

Rick Cavanaugh

Greg Graham

Koen Holtman

(Iosif Legrand


no琠tO?)

(噬慤im楲 i楴i楮


? )

䡡evey 乥睭an

o慪敳e o慪am慲

gorg攠oodr楱u敺

(Conr慤 却敥pb敲g


mm䑇a)

g敮s 噯散k汥l



mm䑇a䙎c
i C䵓

䝲楐hy丠䙎䅌 C䵓

䝲楐hy丠商i C䵓

䝲楐hy丠商i C䵓

mm䑇a䙎䅌 C䵓

䝲楐hy丠Cfq C䵓

mm䑇aCfq
-
Cbok
-
C䵓

䝲楐hy丠Cfq
-
CMp

䝲楐hy丠mm䑇aCfq
-
C䵓

Condor 啗
-


䝲楐hy丠商i
-
C䵓

mm䑇aCfq
-
C䵓

䝲楐hy丠rC
-
C䵓

mhys楣is琻⁍佐 䑥a敬ep敲

啓⁃䵓rCoord楮慴ar

C䵓Mmhys楣
ist

C䵓Mmhys楣is琻⁇t楐hy丠pro橥捴jmgmt

C䵓Mmhys楣is琻⁉䵐䅌A 䑥v敬ep敲

C䵓MCompu瑥t 卣楥i瑩st

C䵓Mmhys楣ist

C䵓Mmhys楣ist

C䵓Mmhys楣is琻⁇t楐hy丠i敡d

C䵓MApp 卵pport

C䵓Mmhys楣ist

C䵓Mmhys楣ist

VDC / VDT development

Injection of CS research topics into C
MS plan here


sequence, when, where, how? Or move some of this
explanation to the master plan.

1.4

Project Year 2 Timetable Overview

This section presents a high
-
level overview of GriPhyN CMS activities and milestones for project year 2. Full
details are cont
ained in the associated Microsoft Project plan document.

Testbed timetable

Q1


Jan


Mar

Milestone:
VDT 1.0 Release

VDT 1.0 Installation on GriPhyN Testgrid machines

Develop testgrid certification tests

Verify testgrid functionality

Milestone:
Test Grid R
eady

Develop USCMS Grid Plan V1 (for Q1
-
Q2 2002)

Circulate plan for feedback

Milestone:

USCMS Grid Plan V1 Approved

USCMS Grid Machines allocated and/or purchased

VDT 1.0 installed on USCMS Grid machines

Verify USCMS Grid V1 functionality

USCMS Grid V1 rea
dy

Milestone
: VDT 2.0 Release (includes first VDC/VDL)

Merge ATLAS, LIGO and Sloan machines into GriPhyN Test Grid (tentative)

Production Plan for USCMS Grid production schedule drafted and circulated

Q2


Apr


Jun

Milestone
: Production Plan for USCMS Gri
d production schedule approved by US
-
CMS management.

Upgrade Testgrid to VDT 2.0

Upgrade USCMS Grid toVDT 2.0

Capacity increase for USCMS Grid

Production Software V1 installed on USCMS Grid


5

Production Software V1 testing



Q3


Jul
-

Sep

CMS testbed upgrad
ed to VDT 1.0

Progrid established

Increase automation functionality

USCMS testbed plan v1

USCMS: n machines at UFL

USCMS: n machines at UW

USCMS: n machines at FNAL, CIT, etc;

USCMS: MOP app suite checkout (certification) test

USCMS: Mop in use by UW phys
ics


CMS Simulation Production Timetable

Q1


Jan


Mar

Productoin Software Test Plan Drafted

Production Software

Q2


Apr


Jun


Q3


Jul
-

Sep

VDL 1 design document

IMPALA & MOP design documents

MOP
-
IMPALA
-
VDT Design meeting I

Design for convergence I

V
MOP
-
VIMPALA Design meeting II

Design for Convergence II

Virtual Data Catalog integration into CMS simulation production framework (MOP)

-

dovetail with CMS production schedule (an agreement and planning document)

-

in use at uw, then uf, then fnal, then o
ther T1 sites

-

available for more ad
-
hoc simulations (user
-
requested; other research groups (eg, uw, ufl)

-

want to have a major production with DAGMan

Vladimir


validate MOP results

Integrate w T2 wbs plans

CMS project review checkpoints

CMS certificati
on tests

CMS design approval


6

Design meetings

Determining role of Impala

Publishing design

Making MOP changes and integration

Friendly user testing

User I/F design

Database selection

Database deployment

Document: detail all logging data that will be recorde
d

Get logging and measurement plan in place

CS research activity: Perform data regeneration tests; measure speed and accuracy


Analysis Prototyping Milestones

Q1


Jan


Mar

No activity planned for this quarter

Q2


Apr


Jun


Q3


Jul
-

Sep

design of da
ta dependency model (based on Koen’s papers)

integration of a vdc into a clarens prototype

GTR for grid requirements for user analysis


Advanced Planner and Policy Functionality
-

Prototyping Milestones

Q1


Jan


Mar

No activity planned for this quarter

Q
2


Apr


Jun


Q3


Jul
-

Sep

CAS in place in TestGrid

CPU sharing policy prototype

Storage sharing policy prototype

Refined DAGman language for end
-
user job submission

2

Simulation Production Challenge Problem Specification

Revise this section in terms of f
eature sets / separate feature sets from challenge problems

CMS:
Virtualization of Monte Carlo production in the MOP high
-
throughput framework. Actual production, at least
at Fermilab, could use the virtualized MOP framework. Other sites will be able to bo
th replicate and materialize data
products produced under Fermilab control.


7

Sites will be able to replicate early initial files in the simulation pipeline and materialize the final files in the
pipeline.

Later in the project year, an automated planner coul
d make decisions about where to execute simulation runs.

OI: what logic/features/functions are needed from each of MOP, Impala, and VDC

2.1

Data derivation tracking

Experiments / demonstration (GriPhyN
-

only) in data reproducibility (but not for production use
)

Fitting features of mop into a DGA architecture and making the solution usable outside griphyn.

Request estimates stored in VDC?

See if virtual data generator functions can be used in MOP

2.2

Handling both File and OO data

Event sets?

Location of tracked dat
a stores

Since the future of Objectivity in the CS collaboration is uncertain,

The early versions of the VDC will do only rudimentary data dependency tracking of objectivity data, most likely in
the form of tracking and treating objectivity databases as o
rdinary files with some pre
-

and post
-
processing
requirements (such as detaching and attaching of database files from their parent federations).

Later versions may explore more sophisticated forms of object re
-
clustering into new database files, and/or fin
er
-
grained object dependency tracking. We plan to explore (at least from a design perspective) the tracking of these
dependencies down to the level of single raw or simulated events.

2.3

Data Management

Staging and replica management; gdmp and rft integ

Deter
mine push and/or pull model for data tracking

Take advantage of replicas

File replication service.

Replica catalog / VDC coherence and integrity

Find the place for replica cataloging: First cut


distributed oracle (T0, T1, T2 each have a catalog: uses Or
acle
replication) CERN, FNAL

Create some degree of metadata cataloging: RAW,ESD,AOD,TAG

(needs RLS in VDT layer in year of in PY3)

Multi
-
site cached file service

Do intelligent staging

Later in year


integrate RFT RFR; revised GDMP

2.4

Job Management

Create a

user request interface to the grid


maybe this is condorG submit

Integration and development of a work planner

Documentation of the job request process and data request process.

Heterogeneous resource schedulers? (Condor, PBS, etc?)

Unification (or co
-
ex
istence) of classads and rsl, and how they will be used


8

Generalization of request planning

Request estimation? (at both file and job level? In this system, whats the outermost unit of request: files or jobs?

See if execution sites could be picked with some

degree of automation in MOP


make MOP requests execution
-
site
independent.

Explicit request planning


picking site of execution; data moved automatically (estimated automatically)

Automated request planning


automatically pick site of execution


tests
, 4Q.

Interaction with site resource policy


CAS


get Catalin involved


broker use of resources without human
intervention (task: design of policy language and interpreter)

Handle executable management


transporting, tracking and execution dependencies

Intreface to/for executabl signature tracking; excutable automated building (move this part to a later PY?)

Clean failure handling and job restart; not FT but a step toward high integrity and ease of use.

Further R&D on fault tolerance and scalability, ce
ntered around the RES job execution service. [CIT team]

Work with the Condor team to develop DAGMan further, in particular its expressiveness in terms of error recovery,
with the goal of applying it to the CMS production system. [UF taking lead]

2.5

Policy

? F
or cpu, disk

2.6

Monitoring and Telemetry

Let a user know job, queue, and system status easily

Return info to diagnose job problems (steps towards grid wide reporting and exception handling)

Grid
-
coordinated time

Monitoring


for data reporting, fault detecti
on, resource utilization


esp disk space

*Work on monitoring software [CIT team led by Iosif Legrand at CERN, also work at UFL]

Deploy some results from the PG monitoring group

2.7

Policy

Disk space allocation policy; cpu alloc policy

Distinguish between Glob
al, local, and individual requests

2.8

Quality Assurance

CMS specification s for production data integrity, code integrity, testing, etc,

Test Plans

Certification Process

Identification of CMS Standards

--------

Resolve vdata cataloging and file naming issues

3

CMS Analysis Challenge Problem Specification

Work further on exploring the impact of end
-
user physics analysis workloads on the grid system, by prototyping
distributed end user analysis tools, demos, and pilot facilities which allow end
-
user physicists wit
hout specific grid
training to accomplish basic physics data manipulation tasks using the grid.


Show user collection creation and
transport over the grid, driven by an easy
-
to
-
understand grid interface. [CIT talking lead]


9

Tracking the data objects of anal
ysis

Scheduling analysis resources

Remote analysis

Tracking tag database schema evolution

Clustering of cut sets

Regeneration of: reconstruction, esd, aod, tag.

Explore database issues:

-

multi
-
database heterogenaeity

-

table naming and version mgmt

-

schema evo
lution and version magement

-

derivation issues for self
-
describing data (ie, tables with a header row)

-

tracking the evolution of schemas as an object

Exploring architecture and component interaction issues:

-

propagation and update of tag databases

-

reclusteri
ng

-

distributing a more detailed analysis over the smallest/best set of sites where the necessary data lives or could be
moved to

-

going all the way back to reconstruction

4

Challenge
-
problem Solution Development Activities

This section describes activities. S
ecion X (5?), below, presents a timetable and responsibilities for the activities.

4.1

Experiment Knowledge
-
base Building and Documentation Activities

Analysis of value of reproducibility in terms of data replication

Develop several revisions of :

Analsys pup
ose and general model


refer to main GriPhyN plan

Data and Application Map

Data dictionary

Tool dictionary

Data requirements spreadsheet

State model for the application: Koen’s docs, GriPhyN reqs doc; Monarc; Conrad’s remote analysis work

Develop the data

flow model for CMS production w/w

Develop data dependency model

-

incoporate both raw and simulated flows

-

incorporate reconstruction (is writeDigis the only current reconstruction tool?)

Enhance CMS web page

Explore where reconstruction needs to happen
(Vladimir: after initial phase of writeDigis

Work with the Globus team and the EU DataGrid to develop a set of long
-
term scalability requirements for the file
replica catalog service.

Do further work on the issue of reconciling the object and object
-
collec
tion nature of the CMS data model with the
file nature of the low
-
level data grid services.


10

Actively participate in the development of a Grid architecture by reviewing architectural documents created in the
Grid projects, and by communicating architectural

lessons learned in CMS production to the Grid projects.

Publication of CMS replication requirements and the dataflow behind it; use as a model for the other experiments

Hook into CMS prardigms for module signature/identification

Build consensus/support fo
r CMS adoption of GMOP

Ensure Clarens is the best platform for building an analysis framework

Refinement of this plan:

GMOP external specification document

GMOP detailed design document (GriPhyN CMS document; meets CMS specification for official project so
ftware)

Analyze the parts of MOP that perform griphyn
-
like operations (for example


figuring out what sequence of
operations needs to be performed.

4.2

Development Activities

The primary focus of CMS activities in GriPhyN during Year 2 will be the development

of virtual data tools for the
production of Monte Carlo simulated CMS data and virtual data tools for the analysis of CMS data. These software
packages will rely both upon existing tools from the current Virtual Data Toolkit as well as contribute new too
ls,
which are general enough in nature, into future versions of the Virtual Data Toolkit.

In support of these activities, efforts will be directed towards the establishment of a catalog infrastructure including:
Replica Catalogs (RC), Virtual Data Catalogs

(VDC), and Meta
-
Data Catalogs (MD). Significant progress towards
the development of a prototype VDC has already been accomplished by Voeckler using a PostGreSQL database
coupled with a Perl interface. The catalog tracks the dependencies of data files an
d transformations between data
files. As such, it is able to regenerate any (missing or deleted) data file on demand.

While CMS does not currently use virtual data concepts, CMS has detailed several future needs related to virtual
data [GRIPHYN 2001
-
16]
. Using the experience gained by integrating the VDC with current CMS production and
analaysis needs (see below), CMS
-
GriPhyN plans to work with CS
-
GriPhyN to further develop virtual data concepts
by taking the following as work items during Year 2:



Work
Item 1: Develop and prototype the concept of "grid uploaded" files and algorithms (i.e. transformations
between data files). Such files and transformations would exist in a grid wide database ("replica catalog?") and
be distinguished by Unique Identifiers

(UID).



Work Item 2: Interface the current VDC with the prototype "replica catalog" so that virtual data products can be
materialized from "grid uploaded" files and transformations. Each materialized virtual data product would
receive a UID and entry into

the "replica catalog" and/or the VDC. Platform dependencies and their relation to
virtual data product UIDs will be investigated.



Work Item 3: Develop prototypes for Consistency Management of the "replica catalog" over a grid. (Open
issue: Should we re
ly on the job to fail and provide "failure" exit codes if the expected data does not exist
--
thereby updating the replica catalog, or should we rely on a "consistent" replica catalog, or a combination of
both?)

Work item: Some amount of intelligence for dea
ling with Objectivity object sharing of simulation results will be
required to place this work into production.

groups at Caltech and Florida will investigate the integration of monitoring and logging.

4.2.1

Production of Monte Carlo Simulated Data

A tool for th
e distributed production of CMS Monte Carlo simulated data, known as MOP, is currently under
development from the Particle Physics Data Grid. MOP (which is based upon Globus, Condor
-
G, and DAGMan) is
loosely integrated with a set of shell scripts, known a
s IMPALA, that are currently used by CMS for Monte Carlo
production as well as the Grid Data Movement Package, or GDMP. Currently these tools do not employ virtual data
concepts.


11

Over the course of Year 2, CMS
-
GriPhyN will augment MOP and IMPALA with virt
ual data tracking using the
Virtual Data Catalog. This will involve decomposing the job submission and bookeeping logic of IMPALA into
parameters and transformations which are specific to CMS and logic which is more general to batch job execution
plannin
g in the form of abstract Directed Acyclic Graphs (DAGs). In addition, the distributed job execution logic of
MOP will be embedded into the VDC to facilitate virtual data materialization in a grid environment. In order to fine
tune these concepts and syn
chronise with CMS production efforts, two challenge problems a proposed over the next
year:



Challenge Problem 1: Produce 50,000 Monte Carlo fully simulated CMS events (including pileup) using the
VDC on a USCMS Grid Testbed (see below). This should expos
e any technical and conceptional modifications
required to use the VDC in realistic situations.



Challenge Problem 2: Fulfill one (or several) official Monte Carlo Production request(s) from CERN on a
USCMS Grid Testbed. This will demonstrate the feasibil
ity of using the VDC in "real world" CMS production
activities.

The aim of this effort is two
-
fold: 1) provide an ever more autonomous environment for CMS Monte Carlo
production by enabling automatic error recovery, rigourous bookeeping, and transparient
production at different
CMS grid sites and 2) provide valuable insight into virtual data concepts for future prototyping.

4.2.2

Remote Data Analysis

Virtual data as it applies to scientific analysis of CMS data has only recently been considered. It is current
ly unclear
whether CMS physicsts will employ a single monolithic analysis tool, or use a standarized set of analysis tools, or
even use different sets of analysis tools. As a result, CMS research into different data analysis paradigms will
continue to b
e monitored by CMS
-
GriPhyN throughout Year 2.

Given that CMS requires that physicists have the option of defining their own sets of data products (files, objects,
etc) for scientific analysis, it important to begin the process of tracking virtual data pro
ducts as they relate to end
-
user data analysis. In order to probe this and to facilitate whatever analysis tool(s) that may be used in the future by
CMS, a remote data server (known as Clarens) is being developed by Steenberg to enable analysis of CMS da
ta
distributed over a wide
-
area network. Clarens is based on a Client/Server approach and provides a framework for
remote data analysis. Communication between the client and server is conducted via XML
-
RPC. The server is
implemented in C++ and linked to
the standard CMS analysis C++ libraries. The client can be end
-
user specific and
implimentations currently exist for several data analysis tools including: C++, PHP, The Java Analysis Studio, and
a Python plug
-
in for SciGraphica. This allows the user fu
ll access to remote CMS data via a choice of analysis
packages.

During Year 2, Steenberg plans to grid
-
enable Clarens by taking full advantage of the Virtual Data Toolkit
including: the Virtual Data Catalog for tracking CMS data, the Globus Security Infra
structure for authentication,
and Grid
-
ftp for CMS data movement:



Work Item 4: Integrate Clarens with VDT 1.0.



Work Item 5: Integrate Clarens with the VDC.

As a joint endeavor with efforts in the virtual data tracking of CMS Monte Carlo production, the f
ollowing data
challenge problem is proposed:



Challenge Problem 3: Remotely analyze 50,000 events using Clarens integrated the VDC as used in Challenge
Problem 1.

Finally, investigations into more fine grained data collections at the object level and their
relation to a VDC will also
be done during Year 2.

4.3

Infrastructure development and deployment activities: a USCMS Testbed

CMS
-
GriPhyN is currently building a USCMS Testbed, in cooperation with the Particle Physics Data Grid, at five
initial sites: The Ca
lifornia Institute of Technology, The University of Florida, Fermi National Accelerator
Laboratory, The University of California
-
San Diego, and The University of Wisconsin
-
Madison. To ensure
interoperability, the testbed will be based upon the GriPhyN Vir
tual Data Toolkit Version 1.0, which includes
Condor 6.3.1 and Globus 2.0.


12

The initial goals of the testbed will be to produce a platform which facilitates grid
-
enabled CMS software
development and which probes policy issues related to User
-
ID management

and Certificate Authorities. As the
USCMS Testbed matures, integration with the US
-
ATLAS testbed is envisaged followed by integration with the
iVDGL.

4.3.1

Software Components

As a baseline, the testbed software will consist of the Virtual Data Toolkit and Obj
ectivity/DB. The entire suite of
CMS software is complex with many external software package dependencies and dynamically linked to shared
object libraries. Hence, the CMS software will initially be fully installed at each grid site consisting of a comm
on
versions of: a CERN patched libraries for the gnu C++ compiler, Anaphe, CMKIN, CMSIM, OSCAR, CARF,
COBRA, and ORCA. However, to facilitate sophisticated CMS software developement, dynamic installation of
versioned (or personalized) CMS software will b
e investigated and implimented via DAR (a "smart" tarball) as
provided from Fermilab.

4.3.1.1

Storage and Handling of Data

Initially, storage and data handling will be performed in an ad hoc way. However, as the testbed matures, the
Storage Resource Broker from S
DSC will be investigated as a possiblity for managing the storage of production data
at each site.

Open issue: how would this interact with GDMP?

4.3.1.2

Data Replication and Virtual Data

The Grid Data Movement Package via Globus, will maintain a Replica Catalog

for movement of Monte Carlo
production data.

Integrate MCAT into the environ (experimentally?


has connections to BaBar work)

SQL based RC and VDC and MCAT in same glue

(Oracle, MySQL?)

4.3.2

Platform Differences

Currently, the CMS software environment support
s the Red Hat 6.1, 6.2 and Solaris operating systems. To ensure
compatibility with the CMS software environment as well as the VDT 1.0, the testbed will be entirely composed of
machines running Red Hat 6.2.

4.3.3

Security and Resource Sharing Policies

Each CMS
-
GriPhyN and CMS
-
PPDG registered user will receive an account at each of the five grid sites. Initially,
the testbed will use the Globus Certificate Authority for the distribution of user and gatekeeper certificates.
However, as ESnet is expected to provi
de a Certificate Authority in early 2002, it is envisaged that the USCMS
testbed will migrate to the ESnet Certificate Authority.

It is expected that CMS, and the iVDGL will set resource sharing and accounting proceedures. In the absence of
such policie
s, resource sharing and accounting proceedures will be studied and implemented on a case
-
by
-
case basis
as needed.

4.3.4

Integration with other Grid Projects

[CIT,UF] Build basic services at 1
-
2 prototype Tier 2 centers. [
Need more info: what are “basic services
”? What
are the prototype centers? How does this relate to testbed?]

Use by other GriPhyN projects?

Ruth:
Are you looking to establish a reference GriPhyN testbed platform ?

Mw: Yes, I am interested first in a common GriPhyN testbed for GriPhyN research
and challenge
-
problem
demonstration, and second for experiment science usage. Its clear to me that the former needs to be built from the
VDT. Its not clear to me to what extent GriPhyN can control or influence the latter.

Ruth: For CMS I believe FNAL shoul
d appear in the infrastructure development and deployment bullets.

Ivdgl and jtb connections? Connecting the 2 SC grids;


13

CA issues? (ala jtb?)

5

Detailed Timetable


Review Changes / Integrate with Rick

6

Dependencies, Risks, Contingencies

CMS project comittme
nt to use GriPhyN
-
enhanced MOP in real production.

Need expertise in CMS apps

Need information/docs on CMS data file formats and object structures; app man pages

Risk: cant convince CMS to let us insert code into its production framework

Risk of EUDG diver
gence

Risk:
Stability of the vdc

Mitig: intensive VDC test plan

Need decision by CMS on OO framework

Requires VDT 2.0

Need people who understand the CMS apps


porting, “harnessing”, and deployment over testbed. Static builds and
related issues;

Need willi
ngness of MOP team to integrate


(Koen)...MOP, PPDG, testbeds

(Koen)...could construct some dependencies here based on CMS goals above...

Explore fluid OS re
-
deployment within a cluster.

App porting

Uncertainty of Objy future; likelihood of transition to
ROOT

7

Open Issues

Where reconstruction happens; what type of file type its captured in.

Effects of pileup on data dependency tracking model

Old statement: does this have any more relevance?: [CIT,UF] Complete High Level Trigger milestones and perform
studie
s with ORCA, the CMS

object
-
oriented reconstruction and analysis software.
[More details! How does this
use VDT services? Need to make clear how this relates to GriPhyN.] MW: is this software simulation of the
hardware HLT? If so, how does it relate to
the MC production that’s part of MOP? Same, similar, or very different?

How will GriPhyN testbeds be structured?

-

can multiple experiments use the same testbed, or whill each have its own?

-

Can we create a single shared GriPhyN research testbed, separate fro
m the 1 to 4 production testbeds?

Do we need Objy virtual data tracking mechanisms


or will tracking at the level of a database file suffice for now?

8

Appendix: Intra
-
Project Technology Transfer

From CMS to other experiments

From other experiments to CMS

S
how how the CMS MOP GriPhyN app and/or architecture could be retrofitted to ATLAS and other experiments.

Xfer to
-
from PPDG, EUDG ?

Integration of GriPhyN results into Globus?


14

9

Items to move back into the original plan

RFT, RFT<
-
>GDMP

CAS

FT

Network use allo
cation driven by policy? (Beyond Y2… Maybe network aware but not network QoS controlling)

Integration with tertiary storage? (Y3)

Place in the overall project:

Y4


advanced planning and resource management; fault tolerance; resource sharing based on polic
y

Y5


adjustments, tuning, more into production


Demonstrate the robustness gains of the use of DAGMan in a realistic CMS production setup by doing a challenge
demo which includes at least 3 sites.


In this demo, certain times system crashes will be injec
ted to show the
capability of the system to auto
-
recover without human intervention.


For EAC


show how research seeds are planted and the fruits grafted into the project in later years


e.g., dvelop FT
ideas now, integ in project Y3

demonstrate an effec
tive mechanism for fault recovery


Plant hooks for policy now; enhance policy language and decision making through planner in later years.


Research topic: explore notions of SLA in a grid environment

---------

9.1.1

Connect GriPhyN to Other Projects

In support
of the above activities, and to give the results the maximum value to the CMS collaboration, we will need
to coordinate GriPhyN efforts with those of other projects, in particular, iVDGL, PPDG and the EU Datagrid
(EUDG). These coordination points include t
he following:



The USCMS testbed will be shared between PPDG and GriPhyN.



GriPhyN will integrate PPDG technologies (such as GDMP) into its VDT



We will seek to achieve GriPhyN
-
EUDG
-
PPDG architectural consensus



We will Define and execute relationship to iVDG
L; integrate with an initial iGOC.



Use logging mechanism from joint PPDG
-
GriPhyN analysis
-
design
-
development process

In addition, the massive computational resources of the NSF
-
sponsored Distributed Teragrid Facility will be of great
potential value to the

CMS collaboration, if plans can be executed which make CMS tools execute reliably,
accurately, and easily on the DTF. These plans will need to address:

-

data transport issues

-

grid security (GSI and certification authority) issues

-

application portabi
lity and certification issues

-

Service Level Agreements regarding resource availability and application performance

-

explore new CMS computing paradigms to exploit DTF architecture

9.1.2

Education and Outreach

Create an always
-
available web
-
based demo


15

Create h
igh
-
level and highly visual project summary material

Demonstrate and publish results

Make the CMS simulation and analysis prototypes available to Non
-
collaborating institutions?

Specific CMS E&O activities include:



9.1.3

Prototype the next advancements in VDT
technology

Job Language specifying constraints n resources needed and locations of execution and storage

CAS Basic planner capable of translating policy and job specifications into an execution plan

with policy language suitable for sharing CMS resources b
etween different experiment groups


---------