A TeraGrid Extension: Bridging to XD

deliriousattackInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 3 χρόνια και 11 μήνες)

320 εμφανίσεις

TeraGrid Extension: Bridging to XD

i








A

TeraGrid Extension: Bridging to XD


Submitted to the National Science Foundation
as an invited proposal.



Principal Investigator

I
an Foster

Director, Computation Institute

National Center for Supercomputing Applications (NCSA)

University of
Chicago

5640 S. Ellis Ave, Room 405

Chicago, IL 60637

Tel:

(630) 252
-
4619

Email:
foster@mcs.anl.gov


Senior Personnel

Phil Andrews

Project Director, NICS

Rich Loft

Director of Technology Development,
CISL/NCAR

Jay
Boisseau

Director,
TACC/UT Austin

Ralph Roskies

Co
-
Scientific Director, PSC

John Cobb

???, ORNL

Carol X. Song

Senior Research Scientist, RCAC/Purdue

Mathew Heinzel

Deputy Director, TeraGrid GIG

Craig Stewart

Associate Dean, Research
Technologies,
Indiana

Nick Karonis

???
, NIU

John Towns

Director, Persistent Infrastructure, NCSA
/Illinois

Daniel
S.
Katz

Director for Cyberinfrastructure Development,
CCT/LSU

Nancy Wilkins
-
Diehr

Area Director for TeraGrid Science Gateways,
SDSC/UCSD


TeraGrid Extension: Bridging to XD

A
-
2

TeraGrid
Principal Investigators (GIG and RPs)

Ian Foster

(GIG)

University of Chicago/Argonne National Laboratory (UC/ANL)

Phil

Andrews

University of Tennessee (UT
-
NICS)

Jay Boisseau

Texas Advanced Computing Center (TACC)

John Cobb

Oak Ridge National Laboratory (ORNL)

Michael Levine

Pittsburgh Supercomputing Center (PSC)

Rich Loft

National Center for Atmospheric Research (NCAR)

Charles McMahon

Louisiana Optical Network Initiative/
Louisiana State University
(
LONI/LSU
)

Mark Sheddon

San Diego Supercomputer Center (SDSC)

Carol Song

Purdue University (PU)

Rick Stevens

University of Chicago/Argonne National Laboratory (UC/ANL)

Craig
Stewart

Indiana University (IU)

John Towns


National Center for Supercomputing Applications (NCSA)



TeraGrid Senior Personnel

Grid Infrastructure Group

Matt Heinzel (UC) Deputy Director

of the TeraGrid GIG

Tim Cockerill (NCSA) Project Management Working
Group

Kelly Gaither (TACC) Data

Analysis and
Vi
sualization

Dave Hart
(SDSC) User Facing Projects

Daniel S. Katz (LSU) GIG Director of Science

Scott Lathrop (UC/ANL) Education, Outreach and Training; External Relations

Elizabeth Leake (UC) External
Relations

Lee Liming (UC/ANL) Software Integration

and Scheduling

Amit Majumdar (SDSC) Advanced User Services

J.P. Navarro (UC/ANL) Software Integration and Scheduling

Mike Northrop (UC) GIG Project Manager

Tony Rimovsky (NCSA) Networking, Operations and S
ecurity

Sergiu Sanielevici (PSC) User Services and Support

Nancy Wilkins
-
Diehr (SDSC) Science Gateways


TeraGrid Extension: Bridging to XD

B
-
1

B

Project Summary

TeraGrid Extension: Bridging to XD

i

C

Table of Contents

A

TeraGrid Extension: Bridging to XD

................................
................................
...................

i

B

Project Summary

................................
................................
................................
.............

B
-
1

C

Table of Contents

................................
................................
................................
.................

i

D

Project Description

................................
................................
................................
.............

1

D.1

Introduction [2 pages


JohnT]

................................
................................
...................

1

D.2

TeraGrid Science [5 pages


DanK]

................................
................................
............

2

D.2.1

[[Science cases]]

................................
................................
................................
....

2

D.3

Software Integration and Scheduling (2.5 pages


LeeL/JPN)

................................
.
10

D.3.1

Advanced Scheduling and meta scheduling

................................
.........................

11

D.3.2

Information Services

................................
................................
............................

11

D.3.3

Packaging and maintiaing CTSS Kits

................................
................................
...

11

D.3.4

Operating infrastructure services

................................
................................
.........

11

D.4

Science Gateways [2.5 pages


NancyW
-
D]

................................
..............................
12

D.4.1

Targeted Support Activities

................................
................................
..................

12

D.4.2

Gateway Support Services

................................
................................
...................

12

D.5

User Support [3 pages


AmitM, SergiuS]

................................
................................
.
12

D.5.1

Advanced User Support

................................
................................
.......................

12

D.5.2

User Services

................................
................................
................................
......

14

D.6

User Facing Projects [2.5 pages


DaveH]
................................
................................
.
14

D.6.1

Online User Presence

................................
................................
..........................

14

D.6.2

Core Services (maybe there is a better name, butit is what we u
se)

....................

15

D.7

Data and Visualization [3 pages


KellyG]

................................
................................
.
15

D.8

Network, Op
erations, and Security [4 pages


TonyR]

................................
.............
16

D.9

Education, Outreach, and Training; and External Relations [3 pages


ScottL]

.....
20

D.9.1

Training

................................
................................
................................
................

20

D.9.2

Education and Outreach

................................
................................
......................

21

D.9.3

Enhancing Diversity and Broader Impacts

................................
...........................

21

D.9.4

External Relations [1.0 FTE @ UC]

................................
................................
......

21

D.9.5

International Collaborations

................................
................................
.................

22

D.10

Organization, Management and Project Management [2.5 pages


JohnT, MattH,
TimC]

................................
................................
................................
................................
....
22

D.10.1

TeraGrid Organization and Management

................................
.............................

22

D.10.2

Advisory Groups

................................
................................
................................
..

22

D.10.3

Project Management and Budget

................................
................................
.........

22

TeraGrid Extension: Bridging to XD

ii

D.10.4

Personnel

................................
................................
................................
............

22


TeraGrid Extension: Bridging to XD

1

D

Project Description


D.1

Introduction
[2 pages


JohnT]

TeraGrid's three
-
part mission is to support the most advanced computational science in multiple
domains, to empower new co
mmunities of users, and to provide resources and services that
can be extended to a broader cyberinfrastructure.

The TeraGrid is an advanced, nationally-distributed, open cyberinfrastructure comprised of
supercomputing, storage, and visualization systems,
data collections, and science gateways,
connected by high-bandwidth networks and integrated by coordinated policies and operations,
supported by computing and technology experts, that enables and supports leading-edge
scientific discovery and promotes scie
nce and technology education

Accomplishing this vision is crucial for the advancement of many areas of scientific discovery,
ensuring US scientific leadership, and increasingly, for addressing critical societal issues.
TeraGrid achieves its purpose and
fulfills its mission through a three

pronged focus:

deep
: ensure profound impact for the most experienced users, through provision of the
most powerful computational resources and advanced computational expertise;

wide
: enable scientific discovery by
broader and more diverse communities of
researchers and educators who can leverage TeraGrid’s high

end resources, portals
and science gateways; and

open
: facilitate simple integration with the broader cyberinfrastructure through the use of
open interface
s, partnerships with other grids, and collaborations with other science
research groups delivering and supporting open cyberinfrastructure facilities.

The TeraGrid’s
deep

goal is to
enable transformational scientific discovery through
leadership in HPC f
or high-end computational research
. The TeraGrid is designed to enable
high

end science utilizing powerful supercomputing systems and high

end resources for the
data analysis, visualization, management, storage, and transfer capabilities required by
large

scale simulation and analysis. All of this requires an increasingly diverse set of
leadership

class resources and services, and deep intellectual expertise.

The TeraGrid’s
wide

goal is to
increase the overall impact of TeraGrid’s advanced
computational
resources to larger and more diverse research and education
communities through user interfaces and portals, domain specific gateways, and
enhanced support that facilitate scientific discovery by people without requiring them to
become high performance com
puting experts
. The complexity of using TeraGrid’s high

end
resources will continue to grow as systems increase in scale and evolve with new technologies.
TeraGrid broadens the scientific user base of its resources via the development and support of
simple

and powerful interfaces, ranging from common user environments to Science Gateways
and portals, through more focused outreach and collaboration with science domain research
groups, and by educational and outreach efforts that will help inspire and educate

the next
generation of America’s leading

edge scientists.

TeraGrid’s
open

goal is twofold: to
ensure the expansibility and future viability of the
TeraGrid by using open standards and interfaces
; and to
ensure that the TeraGrid is
interoperable with other
, open-standards-based cyberinfrastructure facilities
. While
TeraGrid only provides integrated high-end resources, it must enable its high

end
cyberinfrastructure to be more accessible from, and even integrated with, cyberinfrastructure of
TeraGrid Extension: Bridging to XD

2

all scales. Tha
t includes not just other grids, but also campus cyberinfrastructures and even
individual researcher labs/systems. The TeraGrid leads the community forward by providing an
open infrastructure that enables, simplifies, and even encourages scaling out to it
s
leadership

class resources by establishing models in which computational resources can be
integrated both for current and new modalities of science. This openness includes interfaces
and APIs, but goes further to include appropriate policies, support, tr
aining, and community
building.


D.2

TeraGrid Science

[5

pages


DanK]

The science (both research and education) that is enabled by the TeraGrid can generally be
categorized according to the TeraGrid mission:
deep
,
wide
, and
open
. The TeraGrid is oriented
towards user
-
driven projects, with each project being led by a PI who applies for an allocation

to
enable
transformative scientific discovery

by a project
. A project can consist of a set of users
identified by the PI, or a community represented by the PI.

In general,
the TeraGrid’s
deep

focus
represents
projects that are
usually small,

established

groups of users
making highly
-
skilled use of TeraGrid resources
, and
the TeraGrid’s
wide

focus represents
projects that are
either new or established
science
communities

using TeraGrid resources for
both
r
esearch and
education
,

without requiring specific high
-
performance computing skills
,
even
for
users

who are
qu
ite computationally proficient.

In both cases, various capabilities of the TeraGrid’s
open

focus can be needed, such as networking (both
within the TeraGrid and to other resources, such as on campuses or in other cyberinfrastructures),
common grid software (to enable easy use of multiple TeraGrid and non
-
TeraGrid resources), the
TeraGrid user p
ortal (enabling single
-
sign
-
on, common access to TeraGrid resources and information),
services such as metascheduling (automated selection of specific resources), co
-
scheduling (use of
multiple resources for a single job), reservations (use of a resource a
t a specific time), workflows (use of
single or multiple resources for a set of jobs), and gateways (interfaces to resources that hide complex
features or usage patterns
, or tie TeraGrid resources to additional datasets and capabilities
). Additionally,
a
number of the most experienced TeraGrid users simply want the low
-
overhead access to a single
machine.

Even in this category, the variety of architectures of the TeraGrid enables applications that
w
ould not
run well

on simple clusters, including those tha
t require the lowest latency
and microkernel
operating systems
to scale

well
, and those

that require large amounts of shared memory
. While the track2
systems (Ranger and Kraken) will continue to be supported even if this proposal is not funded, the shared

memory systems
(Pople

and

Cobalt)
will not
continue to
be

supplied to the national user committee
without this proposed work
.

These systems are particularly useful to a number of
newer

users, in areas
such game theory, web analytics, machine learning,
etc, as well as being a key part of the workflow in a
number of more established applications, as described below
.

D.2.1


Geosciences


SCEC, PI Tom Jordan, USC

The Southern California Earthquake Center (SC
E
C) is an inter
-
disciplinary research group that includes
over 600 geoscientists, computational scientists and computer scientists from
about 60
various
institutions
, including the

United States Geological Survey.
Its goal is to d
evelop an understanding of
earthquakes
,

and
to mitigate risks
through that knowledge
.
SCEC is a perfect example where the
distributed resources of TeraGrid are used in an integrated manner to achieve transformative geophysical
science results. These science results directly impact
everyday life by contributing to new building codes
(used for construction of buildings in a city, hospitals, and nuclear reactors), emergency planning etc. and
could potentially save billions of dollars in the long run. SCES simulations consist of highly

scalable
runs, mid
-
range core count runs and embarrassingly parallel small
-
core count runs
,

and
they
require high
bandwidth data transfer and large storage for post processing and data sharing.

TeraGrid Extension: Bridging to XD

3

For high core
-
count runs, SCEC researchers use the highly scalable codes (AWM
-
Olsen, Hercules, AWP
-
Graves) on many
ten
s of thousands of processors of
the largest TeraGrid systems (
TACC

Ranger

and
NICS
Kraken) to improve the resolution of dynamic rupture si
mulations by
an
order of magnitude and
to
study
the
impact of geophysical parameters. These highly scalable codes are also used to run high
frequency (1.0 Hz currently and higher in the future) wave propagation simulations of earthquakes
on
systems at

SDS
C, TACC and PSC. Using different codes on different machines and observing the match
between the ground motio
ns projected by the simulations

in needed to

validate
the results.
Systems

are
also chosen based on memory requirements for mesh and source partiti
oning
,

which require
s

large
memory machines
;
PSC
Pople
and
the recently

decommissioned SDSC DataStar computer have been
used for this.

For mid
-
range core count runs, SCEC researchers are carrying out full 3D tomography (called Tera3D)
data intensive runs
on NCSA Abe and SDSC IA64 clusters using few hundreds to few thousand cores.
SCEC researchers are also studying “inverse” problems
that

require running many forward simulations,
while perturbing the ground structure model and comparing against recorded sur
face data. As this inverse
problem requires hundreds of forward runs, it is necessary to recruit multiple platforms to distribute this
work.

Another important aspect of SCEC research is in

the SCEC CyberShake project
, which

has t
he scientific
goal of
using

3D waveform modeling (Tera3D) to calculate

probabilistic seismic hazard

(PSHA)

curves
for sites in Southern California.
A PSHA map provides estimates of the probability that the ground
motion at a site will exceed some intensity measure over a given time
period.
For each
point

of interest,
the CyberShake platform includes two large
-
scale MPI calculations and approximately 840,000
data
-
intensive pleasingly
-
parallel post
-
processing jobs.
The required complexity and scale of these
calculations have impeded
production of detailed PSHA maps; however, through the integration of
hardware, software and people in a gateway
-
like framework, these techniques can now be used to produce
large numbers of research results.
Grid
-
based workflow tools are used to manage th
ese
hundreds

of
thousands of jobs on
multiple
TeraGrid resources
including

the NCSA and SDSC IA
-
64 clusters.

Over 1
million CPU hours were consumed in 2008 through this usage model.

The high core
-
count simulations can produce 100
-
200 TB of output data. Mu
ch of this output data is
registered on the digital library on file systems at NCSA and SDSC’s GPFS
-
wan. In total SCEC requires
close to half a petabyte of archival storage every year. Efficient data transfer and access to large files for
Tera3D project is

of high priority. To ensure the datasets are safely archived,
redundant

copies of the
dataset at multiple locations are
used
. The collection of Tera3D simulations
include

more than

a hundred

millions files
, with

each simulation organized as a separate sub
-
collection in the iRODs data grid.

The distributed cyberinfrastructure of TeraGrid, with
a
wide variety of HPC machines (with different
number of cores, memory/core, varying interconnect performance
, etc.
), high

bandwidth network, large
parallel file systems, wide
-
area file systems, and large archival storage
,

is
needed to
allow SCEC
researchers to carry out scientific research in an integrated manner.

D.2.2

Social S
c
iences

(
SIDGrid
)



PI Rick Stevens, University of Ch
icago TeraGrid
Allocation PI
Sarah Kenny
,
University of

Chicago

SIDGrid is one of a small group of social science teams currently using the TeraGrid and the only science
gateway in this field. Some very unique capabilities are provided for researchers
through SIDGrid.
Researchers make heavy use of “multimodal” data, streaming data which change over time. For example
a human subject is viewing a video, while a researcher collects heart rate and eye movement data. Data
are collected many times per seco
nd and synchronized for analysis, resulting in large datasets.

Sophisticated analysis tools are required to study these datasets, which can involve multiple datasets
collected at different time scales. Providing these analysis capabilities through a gatew
ay has many
advantages. Individual laboratories do not need to recreate the same sophisticated analysis tools.
TeraGrid Extension: Bridging to XD

4

Geographically distant researchers can collaborate on the analysis of the same data sets. Social scientists
from any institution can be involv
ed in analysis, increasing the opportunity for science impact by
providing access to the highest quality resources to all social scientists.

SIDGrid uses TeraGrid resources for computationally
-
intensive tasks such as media transcoding
(decoding and encodin
g between compression formats), pitch analysis of audio tracks and functional
Magnetic Resonance Imaging (fMRI) image analysis. These often result in large numbers of single node
jobs. Current platforms in use include TeraGrid roaming platforms and TACC’
s Spur and Ranger
systems. Workflow tools such as SWIFT have been very useful in job management.

Active users of the SIDGrid system include a human neuroscience group and linguistic research groups
from the University of Chicago and the University of
Nottingham, UK. TeraGrid is providing support to
make use of the resources more effectively. Building on experiences with OLSGW, the same team will
address similar issues for SIDGrid. A new application framework has been developed to enable users to
eas
ily deploy new social science applications in the SIDGrid portal.

D.2.3

Astronomy


PI Mike Norman, UCSD, Tom Quinn, U. Washington

ENZO is a multi
-
purpose code

(
developed by Norman’s group

at UCSD)

for computational astrophysics
.
It uses

adaptive mesh refinemen
t to achieve high temporal and spatial resolution
, and

includes
a particle
-
based method for modeling dark matter in cosmology simulations
,

and state
-
of
-
the
-
art PPML algorithms
for MHD. A version
that
coupl
es

radiation diffusion and chemical kinetics is in
development.

ENZO consists of several components
that

are used to create initial conditions, evolve a simulation in
time
,

and then analyze the results. Each component has quite different computational requirement
,

and the
requirements of the full set of co
mponents

cannot be met at any single TeraGrid site. For example, the
current initial conditions generator for cosmology runs is an OpenMP
-
parallel code that requires a large
shared memory system
;
NCSA Coba
lt is the primary platform that

run
s

this

code at p
roduction scale. The
initial conditions data can be very large
, e.g.,

the initial conditions for a 2048^3 cosmology run
contain

approximately 1 TB of data.

Production runs are done mainly on NICS Kraken and TACC Ranger
,

so the initial conditions generated
on Cobalt must be transferred to these sites over the TeraGrid network using GridFTP. Similarly, the
output from an ENZO simulation must generally be transferred to a site with suitable resources for
analysis and visualization, both of which typically requ
ire large shared memory systems similar to NCSA
Cobalt or PSC Pople or until recently, the P690 nodes of SDSC DataStar. Furthermore, some sites are
better equipped than others to provide long
-
term archival storage of a complete ENZO simulation (of the
orde
r of 100 TB) for a period of several months to several years. Thus, almost every ENZO run at large
scale is dependent on
multiple TeraGrid resources and the
high
-
speed network links between the TeraGrid
sites.

Quinn, at the University of Washington, is u
sing the N
-
body cosmology code GASOLINE for analyzing
N
-
body simulation of structure formation in the universe. This code utilizes the TeraGrid infrastructure in
a similar fashion as the ENZO code. Generation of the initial condition, done using a serial c
ode, requires
several 100 GB of RAM and is optimally done on the
NCSA
Cobalt
system
.
Since t
he highly scalable N
-
body simulations are performed on
PSC Bigben,
the initial condition data
has to be
transferred
over

the
high bandwidth TeraGrid network. The total output, especially
when

the code is run on Ranger and
Kraken, can reach
a
few
petabtyes and approximate one thousand
files. The researchers use visualization
software that allows interactive steering
,

and the
y

are exploring
the

TeraGrid global filesystems
to

ease
data staging for post processing and visualization.

TeraGrid Extension: Bridging to XD

5

D.2.4

Biochemistry/Molecular Dynamics


Multiple PIs (Adrian Roitberg, U. Florida,
Tom Cheatham, U. Utah, Greg Voth, U. Utah, Klaus Schulten, UIUC
, Carlos

Simmerling, Stony Brook, etc.
)

Many of the Molecular Dynamics users use the same codes

(
such as AMBER, NAMD, CHARMM,
LAMMPS
,

etc.
)

for their research
,

although they are looking at different research problems
, such as
drug
discovery, advanced materials res
earch, and advanced enzymatic catalyst design impacting areas such as
bio
-
fuel research
.
The broad variation in the types of calculations needed to complete various Molecular
Dynamics workflows (including
both quantum and classical calculations
)
,

along wi
th large scale storage
and data transfer requirements, define a requirement for a diverse set of resources coupled with high
bandwidth networking. This TeraGrid, therefore, offers an ideal resource for all researchers who conduct
Molecular Dynamics simulat
ions.

Quantum calculations, which are an integral part in the parameterization of force fields, and often used for
the Molecular Dynamics runs themselves in the form of hybrid QM/MM calculations, require large

shared memory machines
like NCSA C
obalt

or PSC

Pople
.
The latest generation of machines that feature
large numbers of processors interconnected by high bandwidth networks
do not lend themselves to the
extremely fine grained parallelization needed for the rapid solving of the self consistent field equa
tions
necessary for QM/MM MD simulations
. (It should be noted that a number of MD research groups are
working on being able to do advanced quantum calculations over distributed core machines such as
Ranger, and Kraken. The availability of those types of ma
chines as testbeds and future production is
incredibly useful for these MD researchers.)
Classical MD runs using the AMBER and NAMD packages
(as well as other commonly available MD packages) use

the distributed
memory

architectures present in
Kraken, Abe
,

and Ranger very efficiently for running long time scale MD simulations. These machines
are allow simulations that were not possible
as recently as two

years ago
,

and
they
are having enormous
impact on
the field of
MD. Some MD researchers use QM/MM techniq
ues, and these researchers benefit
from the existence of machines with nodes that have different amounts of memory per node
,

as the large
memory nodes are used for the quantum calculations, and the other nodes for the classical part of the job.

The reliab
ility and predictability of biomolecular simulation is increasing at a fast pace and is fueled by
access to the NSF large
-
scale computational resources across the TeraGrid. However, researchers are
now
entering a realm where they are becoming deluged by t
he data and its subsequent analysis. More and
more, large ensembles of simulations, often loosely coupled, are run together to provide better statistics,
sampling and efficient use of large
-
scale parallel resources. Managing th
ese

simulations, performing
post
-
processing/visualization
,

and ultimately steer
ing

the
simulations in real
-
time

currently has to be done
on local machines. The TeraGrid Advanced User Support program is working with the MD researchers
to address some of these limitations
.

Although mo
st researchers currently are bringing data back to their
local sites to do analysis, this is quickly becoming impractical and is limiting scientific discovery. Access
to large persistent analysis space linked to the various computational resources on the
TeraGrid
by the
high
-
bandwidth
TeraGrid

network is therefore essential to enabling groundbreaking new discoveries in
this field.

D.2.5

CFD


PI Krishnan Mahesh, U. Minnesota

Access to HPC resources with different system parameters is important for
many
CFD us
ers. Here we
describe the use case scenario of a particular CFD user to show how the distributed infrastructure of
TeraGr
id is needed and utilized by this

user
,

representing
many
other
CFD
projects and
users. Mahesh
uses an unstructured grid computational
fluid dynamics code for modeling
the
very complex geometries
of real life engineering problems. For example,
his

code has been used to conduct large eddy simulations
of incompressible mixing in the exceedingly complex geometry of gas
-
turbines. The unstructured grid
approach has also been extended to compressible flow solvers and used for studying jets in supersonic
crossflow.

TeraGrid Extension: Bridging to XD

6

Th
is

code has been run at large scale, using up to 2048 cores and 50 million control volumes, on SDSC
Blue Gene, DataStar and IA
-
64 cluster and NCSA Mercury. The code shows very good weak scaling and
the communication pattern is largely localize
d to nearest neighbors. Currently the code runs on Ranger
and Kraken and the PI is planning simulations on these machines at larger scales than possible on
DataStar and TeraGrid clusters. These larger simulations will provide the capability to reach resolu
tion at
a scale of Reynolds numbers observed only experimentally today. This will allow
him
to solve
engineering turbulence problems such as the flow around marine propellers (simulating crashback where
the propeller is suddenly spun in the reverse directi
on from its normal direction). A critical component
that is needed before the many thousand core
-
count simulations can be done is the grid generation and
initial condition generation needed by the main runs. This part of the code generation is serial and
r
equires many hundreds of GB of shared memory for large cases. This can only be done on large shared
memory machines such as Cobalt or Pople. The initial data then needs to be transferred to sites such as
TACC and NICS for the large scale simulations. After

the simulation further data access is needed to do
in
-
situ post processing and visualization or the output data is transferred back to the local site, at
University of Minnesota, for post processing and visualization. The high bandwidth network of the
Ter
aGrid is essential for both of these scenarios. Thus even for this, seemingly traditional, CFD user the
distributed infrastructure of TeraGrid with highly scalable machines, large shared memory machines and
the fast network is essential to carry out new en
gineering simulations.


D.2.6

Structural Engineering


Multiple NEES PIs

The Network for Earthquake Engineering Simulation (NEES) project is an NSF
-
funded MRE project that
seeks to lessen the impact of earthquake and tsunami
-
related disasters by providing revolutionary
capabilities for earthquake engineering research.
A

state
-
o
f
-
the
-
art network
links w
orld
-
class
experimental facilities around the country, making it possible for researchers to collaborate remotely on
experiments, computational modeling, data analysis and education. NEES currently has about 75 users
spread across
about 15 universities. These users use TeraGrid HPC and data resources for various kinds
of structural engineering simulations using both commercial codes and research codes based on
algorithms developed by academic researchers. Some of these simulations,
e
specially
those
using
commercial FEM codes such as Abaqus, Ansys, Fluent, and LS
-
Dyna, require moderately large shared
memory nodes, such as the large memory nodes of Abe and Mercury
,

but scale to only few
ten
s of
processors using MPI.
Large memory is nee
ded
so

that the whole mesh structure
can be

read in

to a single

node and this is necessary due to the basic FEM algorithm applied for some simulation problems.


Many
of these codes
have

OpenMP parallelization
,

in addition to MPI parallelization
,

and use
rs mainly utilize
shared memory nodes using OpenMP

for pre/post processing. On the other hand
,

some of the academic
codes, such as the OpenSees simulation package tuned for specific material behavior, ha
ve

utilized many
thousands of processors of machines
likes DataStar, Kraken, and Ranger
,

scal
ing

well at these high core
counts. Due to the geographically distributed location of NEES researchers and experimental facilities,
high bandwidth data transfer and data access are vital requirement.

NEES researcher
s also perform “hybrid tests” where multiple geographically distributed structural
engineering experimental facilities (
e.g.,

shake tables) perform structural engineering experiments
simultaneously in conjunction with simulations running on TeraGrid resour
ces. Some complex pseudo
real
-
life engineering test cases can only be captured by having multiple simultaneous experiments
coupled with complementary simulations
,

as they are too complex to perform by either experimental
facilities or simulations alone. Th
ese “hybrid tests” require close coupling and data transfer in real time
between experimental facilities and TeraGrid compute and data resources using the fast network. NEES
as a whole is dependent on the variety of HPC resources of TeraGrid, the high band
width network and
data access and sharing tools.

TeraGrid Extension: Bridging to XD

7

D.2.7

Biosciences


PI George Karniadakis, Brown University

High
-
resolution,

large
-
scale simulations of a blood flow in the human arterial tree require solution of flow
equations with billions degrees of freedom.
In order to perform such computationally demanding
simulations, tens or even hundreds of thousands computer processors must be employed. Use of a
network of distributed computers (TeraGrid) presents an opportunity to carry out the
se

simulations
efficiently
; however, new computational methods must be developed.

The Human Arterial Tree project has developed a new scalable approach for simulating large multiscale
computational mechanics problems on a network of distributed computers or grid. The method has bee
n
successfully employed in cross site simulations connecting SDSC, TACC, PSC, UC/ANL, and NCSA.

The project considers 3D simulation of blood flow in the intracranial arterial tree using NEKTAR
-

the
spectral/hp element solver developed at Brown University.

It employs a multi
-
layer hierarchical approach
whereby the problem is solved on two layers. On the inner layers, solutions of large tightly coupled
problems are performed simultaneously on different supercomputers, while on the outer layer, the
solution o
f the loosely coupled problem is performed across distributed supercomputers, involving
considerable inter
-
machine communication. The heterogeneous communication topology (i.e., both intra
-

and inter
-
machine communication) is performed initially by MPICH
-
G
2 and later with the recently
developed MPIg libraries. MPIg's multithreaded architecture provides applications with an opportunity to
overlap computation and inter
-
site communication on multicore systems. Cross
-
site computations
performed on the TeraGrid'
s clusters demonstrate the benefits of MPIg over MPICH
-
G2. The multi layer
communication interface implemented in NEKTAR permits efficient communication between multiple
groups of processors. The developed methodology is suitable for solution of multi
-
sc
ale and multi
-
physics problems on distributed and on the modern petaflop supercomputers.

D.2.8

Neutron Science


PI John Cobb, ORNL

The Neutron Science TeraGrid
G
ateway
(NSTG)
project
is an exemplar for

the
use

of cyberinfrastructure
for simulation and data anal
ysis that are coupled to an experiment. The unique contributions of NSTG are
the connection of national user facility instrument data sources to the integrated cyberinfrastructure of the
TeraGrid and the development of a neutron science gateway that allow
s neutron scientists to use TeraGrid
resources to analyze their data, including comparison of experiment with simulation. The NSTG is
working in close collaboration with the Spallation Neutron Source (SNS) at Oak Ridge as their principal
facility partner.
The SNS is a next
-
generation neutron source
, which
has completed construction at a cost
of $1.4 billion and is ramping up operations. The SNS will provide an order of magnitude greater flux
than any other neutron scattering facility in the world and will b
e available to all of the nation's scientists,
independent of funding source, on a reviewed basis. With this new capability, the neutron science
community is facing orders of magnitude larger data sets and is at a critical point for data analysis and
simul
ation. They recognize the need for new ways to manage and analyze data to optimize both beam
time and scientific output. The TeraGrid is providing new capabilities in the gateway for simulations
using McStas and for data analysis by the development of a fi
tting service. Both run on distributed
TeraGrid resources, at ORNL, TACC and NCSA, to improve turnaround.
NSTG is

also exploring
archiving experimental data on the TeraGrid. As part of the SNS partnership, the NSTG provides gateway
support, cyberinfrastr
ucture outreach, community development, and user support for the neutron science
community
,

includ
ing

not only SNS staff and users
,

but extend
ing

to all five neutron scattering centers in
North America and several dozen worldwide.

D.2.9

Chemistry (
GridChem
)



Project PI John Connolly, University of Kentucky,
TeraGrid Allocation PI
Sudhakar Pamidighantam
, NCSA

Computational chemistry forms the foundation not only of chemistry, but
is required
in materials science
and
biology as well. Understanding

molecular

structure and function are beneficial in the design of
materials for electronics, biotechnology and medical devices and also in the design of pharmaceuticals.
TeraGrid Extension: Bridging to XD

8

GridChem, an NSF Middleware Initiative (NMI) project, provides a reliable infrastructure and
ca
pabilities beyond the command line for computational chemists. GridChem, one of the most heavily
used
TeraGrid science
gateways in 2008, requested and
is
receiv
ing

advanced support
resources
from the
TeraGrid.
This a
dvanced support work will address a number of issues, many
of which will benefit

all
gateways. These
issues
include common user environments for domain software, standardized licensing,
application performance characteristics, gateway incorporation of add
itional data handling tools and data
resources, fault tolerant workflows, scheduling policies for community users
,

and remote visualization.
This collaboration with TeraGrid staff
is ongoing

in 2009.

D.2.10

Astrophysics
-

PI Erik Schnetter, LSU
,
Christian D. Ott
, Caltech
, Denis Pollney,
and Luciano Rezzolla
, AEI

Cactus <http://www.cactuscode.org> is an HPC software framework enabling parallel computation across
different architectures and collaborative code development between different groups. Cactus originated

in the academic research community, where it was developed and used over many years by a large
international collaboration of physicists and computational scientists. Cactus is now mainly developed at
LSU with major contributions from the AEI in Germany,

and is predominantly used in computational
relativistic astrophysics where it is employed by several groups in the US and abroad.

An application that is based on Cactus consists of a set individual modules (

thorns

) that encapsulate
particular physical,
computational, or infrastructure algorithms. A special

driver


thorns provides
parallelism, load balancing, memory management, and efficient I/O. One such driver is Carpet
<http://www.carpetcode.org>
,

which supports both adaptive mesh refinement (AMR) w
ith subcycling in
time and multi
-
block methods, offering a hybrid parallelisation combining MPI and OpenMP. An
Einstein Toolkit provides a common basic infrastructure for relativistic astrophysics calculations. Cactus,
Carpet, the Einstein Toolkit, and m
any other thorns are available as open source, while most cutting
-
edge
physics thorns are developed privately by individual research groups.

Current significant users of Cactus outside LSU include AEI (Germany), Caltech, GA Tech, KISTI
(South Korea), NASA
GSFC, RIT, Southampton (UK), Tübingen (Germany), UI, UMD, Tokyo (Japan),
and WashU. Ongoing development is funded among others via collaborative grants from NASA (ParCa,
with partners LSU, GSFC, and company Decisive Analytics Corporation) and NSF (XiRel,
with partners
LSU, GA Tech, RIT).

The LSU
-
AEI
-
Caltech numerical relativity collaboration uses Cactus
-
based applications to study binary
systems of black holes and neutron stars as well as stellar collapse scenarios. Numerical simulations are
the only prac
tical way to study these systems, which requires m
odel
ing the Einstein equations, relativistic
hydrodynamics, magnetic fields, nuclear microphysics, and effects of neutrino radiation. This results in a
complex, coupled system of non
-
linear equations
describing effects that span a wide range length
-

and
time
-
scales which are addressed with high
-
order discretization methods, adaptive mesh refinement with
up to 12 levels, and multi
-
block methods. The resulting applications are highly portable and have b
een
shown to scale up to 12k cores, with currently up to 2k cores used in production runs.

Production runs are mainly performed on Queen Bee at LONI, Ranger at TACC, and on Damiana at the
AEI, and it is expected that Kraken will soon also be used for produ
ction. These applications prefer to
have 2 GByte of memory per core available due to the paralleli
z
ation overhead of the higher order
methods, but can run with less memory if OpenMP is used
, though c
ombining OpenMP and MPI does not
always increase perform
ance. Initial configurations are typically either calculated at the beginning of the
simulation or are imported from one
-
dimensional data. They may involve a large number of time steps,
leading to wall
-
clock times of 20 days or more for a high
-
resolution

run.

D.2.11

Biosciences (
Robetta
)



PI David Baker, University of Washington

Protein structure prediction is one of the more important components of bioinformatics. The Rosetta
code, from the David Baker laboratory, has performed very well at CASP (Critical Ass
essment of
TeraGrid Extension: Bridging to XD

9

Techniques for Protein Structure Prediction) competitions and is available for use by any academic
scientist via the Robetta server


a TeraGrid science gateway. Robetta develo
pers were able to use
TeraGrid’s

gateway infrastructure, including c
ommunity accounts and Globus, to allow researchers to run
Rosetta on TeraGrid resources through the gateway.
T
his very successful group

did not need any
additional
TeraGrid assistance to build the Robetta gateway; it was done completely be using the tools

TeraGrid provides to all potential gateway developers.
Google scholar reports 601 references to the
Robetta gateway, including many PubMed publications. Robetta has made extensive use of a TeraGrid
roaming allocation and will be investigating additional
platforms such as Purdue’s Condor pool and the
NCSA/LONI
Abe
-
QueenBee systems.

D.2.12

GIScience


PI Shaowen Wang, University of Illinois

The GIScience gate
way, a geographic information systems (GIS) gateway
,

has over 60 regular users and
is used by undergraduates in coursework at UIUC. GIS is becoming an increasingly important
component of a wide variety of fields. The GIScience team has worked with researchers in fields as
distinct as ecological and enviro
nmental research, biomass
-
based energy, linguistics (linguist.org),
coupled natural and human systems and digital watershed systems, hydrology and epidemiology. The
team has allocations on resources in TeraGrid ranging from TACC’s Ranger system to NCSA’s s
hared
memory Cobalt system to Purdue’s Condor pool and Indiana’s BigRed system. Most usage to date has
been on the
NCSA/LONI
Abe
-
Queen
B
ee system
s
. The GIScience gateway may also lead to
collaborations with the Chinese Academy of Sciences through the work

of the PI.

D.2.13

Computer Science: Solving
L
arge
S
equential
T
wo
-
person
Z
ero
-
sum
G
ames of
Imperfect
I
nformation


PI Tuomas Sandholm, Carnegie Mellon University

Professor Sandholm’s work in game theory is internationally recognized. While many games can be
formulated mathematically, the formulations for those that best represent the challenges of real
-
life
human decision making (in national defense, economics, etc
.) are huge. For example, two
-
player poker
has a game
-
tree of about 1018 nodes. In the words of Sandholm's Ph.D. student Andrew Gilpin,

To solve
that requires massive computational resources. Our research is on scaling up game
-
theory solution
techniques t
o those large games, and new algorithmic design.


The most computationally intensive portion of Sandholm and Gilpin's algorithm is a matrix
-
vector
product, where the matrix is the payoff matrix and the vector is a strategy for one of the players. This
oper
ation accounts for more than 99% of the computation, and is a bottleneck to applying game theory to
many problems of practical importance. To drastically increase the size of problems the algorithm can
handle, Gilpin and Sandholm devised an approach that e
xploits massively parallel systems of non
-
uniform memory
-
access architecture, such as Pople, PSC’s SGI Altix. By making all data addressable
from a single process, shared memory simplifies a central, non
-
parallelizable operation performed in
conjunction wi
th the matrix
-
vector product. Sandholm and Gilpin are doing experiments to learn how the
shared
-
memory code performs, and points to areas for further algorithmic improvem
ent.

D.2.14

Nanoscale Electronic Structures/
nanoHUB



PI Gerhard Klimeck, Purdue
University

Gerhard
Klimeck’s lab is tackling the challenge of designing microprocessors and other devices at a time
when their components are dipping into the nanoscale



a billionth of a meter. The new generation of
nano
-
electronic devices requires a quantum
-
mechani
cal description to capture properties of devices built
on an atomic scale.
This is required to study

quantum dots

(
spaces where electrons are corralled into
acting like atoms, creating in effect a tunable atom for optical applications
),

resonant tunneling
diodes
(
useful in very high
-
sp
eed circuitry), and

tiny nanowires.
The
simulations
in this project
look two or three
generations down
-
the
-
line as components continue to shrink, projecting their physical properties and
performance characteristics under a var
iety of conditions before they

a
re fabricated. The
codes

also
are
used to model

quantum computing.

TeraGrid Extension: Bridging to XD

10

Klimeck
’s team

received an NSF Petascale Applications award for his NEMO3
-
D and OMEN software
development projects, aimed at efficiently using the petascale

systems that are being made

available by
the TeraGrid. They have

already employed the software in multimillion
-
atom simulations matching
experimental results for nanoscale semiconductors
, and have
run a prototype of the new OMEN code on
32,768 cores of T
ACC’s Ranger system.
They also use

TeraGrid resources at NCSA, PSC, IU, ORNL
and Purdue. The
ir

simulations involve millions to billions of interacting electrons, and thus require highly
sophisticated and optimized software to run on the TeraGrid’s most pow
erful systems
.

Different code and
machine characteristics may be best suited to different specific research problems, but it is important for
the

team to plan and execute their virtual experiments on all these resources in a coordinated manner, and
to easi
ly

transfer data between systems.

This project

aim
s

not only
at direct research,
but
also is creating

modeling and simulation tools
that
other
researchers, educators
,

and students can

use through NanoHUB
,

a TeraGrid Science Gateway,

designed to
make doing
research on the TeraGrid easier
. The PI

likens the situation to making computation as easy as
making phone calls or driving cars, without being a telephone technician or an auto mechanic. Overall,
nanoHUB.org is hosting more than 90 simulation tools
,

with

more than 6,200 users who ran more than
300,000 simulations in

2008
. The hosted codes range in computational intensity from very lightweight to
extremely intensive
, such as

NEMO 3
-
D and OMEN. The nanoHUB.org site
has
more than 68,000 users
in 172 countries
,

with a system uptime of more than 99.4
-
percent. More than 44 classes used the resource
for teaching.
According to
Klimeck
,

it has become an infrastructure people rely on for day
-
to
-
day
operations.

nanoHUB
plans on

being
among the
early testers for
the
me
tascheduling
capabilities currently being
developed by the TeraGrid, since i
nteractivity and reliability are high priorities for nanoHUB

users
.
The
Purdue team
is also
looking at additional

communities that might benefit
from the use of
HUB technology
and

TeraGrid. The

Cancer Care Engineering HUB

is one such community
.

D.2.15

Atmospheric Sciences (
LEAD
)
, PI Kelvin Droegemeier, University of Oklahoma

In preparation for the spring 2008 Weather Challenge, involving 67 universities, the LEAD team and
TeraGrid began
a very intensive and extended “gateway
-
debug” activity involving Globus developers,
TeraGrid resource provider (RP) system administrators and the TeraGrid GIG software integration and
gateway teams. Extensive testing and evaluation of GRAM, GridFTP, and R
FT were conducted on an
early CTSS V4 testbed especially tuned for stability. The massive debugging efforts laid the foundation
for improvements in reliability and scalability of TeraGrid’s grid middleware for all gateways. A
comprehensive analysis of job
submission scenarios simulating multiple gateways will be used to conduct
a scalability and reliability analysis of WS GRAM. The LEAD team also participated in the NOAA
Hazardous Weather Testbed Spring 2008 severe weather forecasts. High resolution on dema
nd and urgent
computing weather forecasts will enable scientists study complex weather phenomenon in near real
-
time.

A pilot program with Campus Weather Service (CWS) groups from atmospheric science departments
from universities across the country. Miller
sville University and University of Oklahoma CWS users have
been predicting local weather in 3 shifts per day with 5km, 4km and 2km forecast resolutions computing
on Big Red and archiving on the IU Data Capacitor. Development of reusable LEAD tools continu
es. The
team is supporting the OGCE released components


Application Factory, Registry Services and
Workflow Tools. TeraGrid supporters have generalized, packaged and tested the notification system and
personal metadata catalog to prepare for an OGCE rele
ase to be used by gateway community and will
provide workflow support to integrate with the Apache ODE workflow enactment engine.


D.3

Software Integration

and Scheduling (
2.5

pages


Lee
L
/JP
N
)


TeraGrid Extension: Bridging to XD

11

[[Introduce area]]

[[Why is this important? How does it
support the science case?]]


D.3.1

Advanced Scheduling and meta scheduling

The purpose of this work is to
maintain TeraGrid's existing metascheduling and coscheduling
capabilities in support of TeraGrid users who require cross
-
site scheduling capabilities for th
eir
work.

What activities continue into the Extension? What hardware is continuing into the Extension? Is
there a large effort here for the interactive computing effort?


[jtowns: claim we will deploy on x86 architectures and further pursue OSG integratio
n, might be
something here]


D.3.2

Information Services

Should the only effort here be to assist new resources with implementing the TG Information
Services? New resources in the Extension will include T2D and XD
-
Hardware.


D.3.3

Packaging and maintiaing CTSS Kits

Arg
onne National Laboratory, NCSA, and University of Chicago will provide the TeraGrid
Software Integration Packaging effort. The “GIG
-
Pack” team is collectively responsible for
generating:

--

rebuild software component on TeraGrid resources to address securi
ty vulnerabilities and
functionality issues;

--

new builds of software components across all TeraGrid resources to implement new CTSS
kits;

--

new builds of software components to allow their deployment and use on new TeraGrid
resources.

Argonne Nationa
l Laboratory
, NCSA, TACC, and University of Chicago
will
provide

the TeraGrid
Software Integration CTSS

maintenance effort
.
T
his team will respond to help desk tickets
concerning existing CTSS capability kits, debug software issues (including but not limi
ted to
defects), and work with software providers to resolve software defects.



D.3.4

Operating infrastructure services

Argonne National Laboratory
, University of Chicago, TACC, and NCSA

will
provide

the TeraG
rid
Software Integration Operations with
Infrastruct
ure Services effort
s. This work covers o
perating
the TeraGrid
-
wide component of TeraGrid’s Information Services (central index service and
WebMDS service at University of Chicago and a set of redundant services a
t a commercial
hosting service),
operating
the TeraGrid
-
wide component of TeraGrid’s Build and Test Service
(a Metronome service at University of Chicago)
, and operating
centralized scheduling services
for TeraGrid.


TeraGrid Extension: Bridging to XD

12

D.4

Science Gateways

[
2.5

pages


NancyW
-
D]


[[Introduce area]]

[[Why is this
important? How does it support the science case?]]


D.4.1

Targeted Support Activities

NancyW
-
D needs to provide text for proposal describing high priority items likely to be funded
and refer to process by which things will be decided upon (annual planning
process)

The gateway targeted support program provides assistance to gateways wishing to integrate
TeraGrid resources.
The request process
for support
will be clear
. I
t will be clear to staff
members what

they are working on and when.
Lessons learned wil
l be included in general
gatew
ay documentation/case studies. Outreach will be conducted to make sure that
underrepresented communities are aware of the targeted support program.


D.4.2

Gateway Support Services

Provide helpdesk support for production science gate
ways by answering user questions, routing
user requests to appropriate gateway contacts, and tracking user responses. Provide input for
the knowledge base to improve gateway helpdesk services.

Continuation of the Yr5 effort. Yr5 developed a standard implem
entation for community
accounts with a single implementation deployed as an exemplar. During the Extension there is a
need to support additional gateways with deployment of the standard implementation.

There are continuing needs for gateway documentation a
nd tutorials.

Yr4/5 developed the Code Discovery implementation


a means for determining available
software at a SGW analogous to command
-
line tools on the TG. During the Extension there is a
need to support additional gateways with deployment of this imp
lementation.



D.5

User Support

[3 pages


AmitM, SergiuS]


[[Introduce area]]

[[Why is this important? How does it support the science case?]]


D.5.1

Advanced User Support

As an area in which the TeraGrid can have major impact, the funding levels for this should be

retained, but a much more aggressive strategy should be developed to much more proactively
pursue users and we should be more willing to provide a higher level of support and service.
This means a willingness to do more “for” users as opposed to working
“with” them on some of
these activities. This is hard to capture in text, but it can be explained.

TeraGrid Extension: Bridging to XD

13

The AUS area will continue to work, in a collaborative fashion, with the User Support (US) area,
the Science Gateways (SGW) area, the EOT/Broadening Partici
pation (EOT) area and the User
Facing Projects and Core Services (UFC) area to provide a comprehensive user support for all
the TeraGrid users. Following the established transparent process, AUS activities will focus on
identifying requests and potential o
pportunities for providing AUS to users, prioritizing those
requests and opportunities, and matching appropriate AUS staff to those requests and
opportunities. In the Extension the total FTE count for AUS includes FTEs provided by GIG for
AUS, as well as

FTEs provided by carry forward funding by some of the RPs.



D.5.1.1

Advanced Support TG
Applications

Under this subarea, one or more TeraGrid AUS staff members will provide targeted advanced
support to users to enhance the effectiveness and productivity of users
’ applications utilizing
TeraGrid resources. This support will include porting applications to new resources,
implementing algorithmic enhancements, implementing parallel programming methods,
incorporating mathematical libraries, improving the scalability
of codes to higher processor
counts, optimizing codes to efficiently utilize specific resources, enhancing scientific workflows,
tackling visualization and data analysis tasks, and implementing innovative use of resources. It
should be noted that providing

advanced support for the scientific applications supported by the
Science Gateways also fall under AUS.



D.5.1.2

Advanced Support Projects

In addition to the ASTA support the complex and leading edge nature of the TeraGrid
infrastructure necessitates two more c
ategories of advanced projects that can benefit large
number of TeraGrid users. One of the categories requires advanced projects, carried out by
AUS staff at RP sites that only an AUS staff with higher level of expertise and experience can
perform. These a
re necessary tasks that fall within the AUS area and benefits a large number of
TeraGrid users. In some cases these tasks require maintenance mode of operation in an
ongoing basis. The second category includes proactive projects that AUS staff will underta
ke, in
consultation with the TeraGrid user community, and these have the potential to benefit large
number of domain science specific users or users of specific algorithms and methodologies or
mathematical libraries etc. In the following section we describ
e these two categories of
advanced support projects.



D.5.1.3

Advanced Support EOT

Under this subarea, in coordination with the TeraGrid EOT area, AUS staff will provide outreach
to the TeraGrid user community about the availability of and process for requesting

ASTA and
ASP. This outreach will be done by (1) posting regular TeraGrid news, about AUS availability,
before every TRAC quarterly deadline, (2) contacting NSF program directors that fund
computational research projects and making them aware of availabili
ty of AUS, (3) having
TeraGrid staff and leaders advertise about AUS availability at appropriate conferences and
workshops, and (4) planning and organizing TG09 and other workshops and attending these
workshops.

TeraGrid Extension: Bridging to XD

14


D.5.2

User Services

frontline user support coord
ination
,
the User Services Working Group, which is the general
forum for addressing cross
-
site user issues and sharing best practices and technical information
among RP support staff. coordinate frontline user support with the other GIG areas, participatio
n
in the User Champion, Campus Champion, and Pathways efforts. User Interaction Council, to
coordinate all aspects of user support, engagement, and information.


D.5.2.1

User Engagement

Coordinate the TeraGrid process for
developing, administering and reporting t
he TeraGrid user
survey. Extract, report, and act upon continuous feedback via the User Champions program,
which is the TeraGrid’s primary tool for proactive engagement with Research allocations level
user groups, and via the Campus Champions
program
aimed

at Startup and Education grants,
and at actively fostering diversity in terms of both fields of science and of demographics.
Organize alpha and beta testing of new CTSS capabilities.



D.5.2.2

Frontline User Support

Share and Maintain Best Practices for Ticket R
esolution across all RPs

Share and maintain best practices for ticket resolution across all RPs. Focus on providing users
with a substantive clarification of the nature of the problem and the way forward, quickly and
accurately. Focus on the coordinated r
esolution of complex problems spanning RPs and
systems.

Consulting support

HelpDesk and Ticketing system



D.6

User Facing Projects [2.5

pages


DaveH]

[[Introduce area]]

[[Why is this important? How does it support the science case?]]


D.6.1

Online User Presence

Web Site

User Portal

Knowledgebase

Documentation

This objective ensures that users are provided with current, accurate information from across
the TeraGrid in a dynamic environment of resources, software and services.


TeraGrid Extension: Bridging to XD

15


D.6.2

Core Services (maybe there is a
better name, butit is what we use)

Allocations, Accounting, Account Management



D.7

Data and Visualization [3 pages


KellyG]


[[Introduce area]]

[[Why is this important? How does it support the science case?]]


Other notes:

While there is a need for
visualization resources, by this point in time the ANL visualization
cluster will have been phased out of service. In addition, while Spur at TACC is already
operational, this resource is partially (50%) associated with TACC’s Track 2a award and staffing
and O&M are covered from that budget. Given that two XD/Remote Visualization and Data
Analysis awards will be made prior to the TG Extension period

and need to try and plan for their
integration
.



D.7.1.1

Visualization

Do we continue Viz Gateway efforts


what
has usage been thus far and how long has the
gateway been available to users?

Are we providing general Viz user support here, or is that
being done in AUS?



D.7.1.2

Lustre
-
WAN Deployment:

In PY5 TG committed to Lustre
-
WAN project
-
wide solution .

New effort in the

TG Extension, provides $100k hardware at @ PSC, NCSA, IU, NICS, TACC;
also provide 0.50 FTE @
PSC, NCSA, IU, NICS, TACC,
SDSC

Deploy additional Luster
-
WAN disk resources as part of a Lustre
-
WAN to be available on all
resources continued to TG Extension.

Will also look at extending to Track 2d awadee and
XD/Remote Viz awardee resources as appropriate.


Will also support:

SDSC GPFS
-
WAN:
700TB capacity; continue to provide support for data collections, and
participate in archive data replication service pr
oject, potentially as a wide
-
area filesystem or
high
-
speed data cache for transfers. If appropriate, hardware resources could be re
-
directed to
participate in TG
-
wide Lustre
-
WAN solution.
700TB capacity; move to Luster
-
WAN to participate
in TG
-
wide Luster
-
W
AN solution, continue to provide support for data collections, participate in
TeraGrid Extension: Bridging to XD

16

archive data replication service project to potentially be a wide
-
area filesystem supporting that
service.

IU Lustre
-
WAN
: 984 TB capacity; be data resource for TG
-
wide Luster
-
W
AN filesystem,
participate in archive data replication service project to potentially be a wide
-
area filesystem
supporting that service.


There has also been ongoing hope for an emerging pNFS solution that now potentially would
appear in the time period of

the TG Extension. This will need to be taken into account in the
scope of the proposal. For these resources, pNFS integration would need to be investigated if
the technology has promise in that time frame. We should also consider investigating other
alte
rnatives (e.g. PetaShare). Both of these would likely fall under “project” types activities
traditionally funded by the GIG.



For consideration, there is work at PSC, with initial development supported outside the NSF, for
developing an ability to federa
te disparate file systems. The resultant “global” metadata server
will function as the basis for what appears to be a single, federated file system. This will provide
an amalgamated view of all files in the “federation” which will be very useful for user
s and will
also provide the basis for “file locating policies” which might well include making multiple,
distributed copies of files. This would provide the “replicated, distributed archival” capability
without requiring uniform file or archival systems.

Initial testing will commence in 1
-
3 months.
PSC would need support for turning it into a production ready resource.





D.7.1.3

Archive Replication

This is to provide support to the user community and applications tomake use of the replication
service. TG

Archive proposal is only funding hardware and replication service development.
Likely need to also look at performance and functionality issues.



D.7.1.4

DV
-

Data Movement Performance

Tools should be in place by end of Yr5

effort here to maintain tools,

collect and analyze data,
and feedback of data into the QA? Or turn this over to QA?




D.8

Network
,

Operations
,

and Security [4

pages


TonyR]


TeraGrid Extension: Bridging to XD

17

[[Introduce area]]

[[Why is this important? How does it support the science case?]]



D.8.1.1

Networking

This will include
coordination of the networking working group on network maintenance,
networking contracts and communication with sites wishing to connect to TeraGrid.


Network description.


Also supporting:

LONI: network connectivity

SDSC: network connectivity; LA Hub
networking support

IU: network connectivity

Purdue: network connectivity

ORNL: network aggregation switch maintenance

PSC: network connectivity (3 months)



D.8.1.2

Operational Services


TOC Services

Continue the TOC and 800 number during the Extension. The TOC
is heavily leveraging the
NCSA 24x7 helpdesk service. Careful to not overlap with User Support section on HelpDesk,
etc.


Operational Instrumentation (device tracking)

support existing tools and take care of reporting activities? There will be a need to
incorporate
the new resources (T2D and XD
-
Hardware) into these tools.


INCA
Monitoring

SDSC will continue to manage and maintain the Inca monitoring deployment on TeraGrid,
including writing and updating Inca reporters (test scripts), configuring and deplo
ying reporters
to resources, archiving test results in a Postgres database, and displaying and analyzing
reporter data in Web status pages.
We

will work with RP administrators to troubleshoot
detected failures on their resources and make improvements t
o existing tests and/or their
configuration on resources. In addition, we plan to write or wrap any new tests identified by
TeraGrid working groups or CTSS kit administrators and deploy them to TeraGrid resources.
TeraGrid Extension: Bridging to XD

18

We will modify Web status pages as CTSS
and other working group requirements change.
SDSC will continue to upgrade the Inca deployment on TeraGrid with new versions of Inca (as
new features are often driven by TeraGrid) and optimize performance as needed.


D.8.1.3

RP Operations

Several compute and dat
a resources will be supported into the TG Extension period. Will need
text from each RP to support this and this will need to be woven into a comlete story.


D.8.1.3.1

Compute Resources


Providing pure capacity is not a strong argument given that (a) Ranger is into

its production life
and as yet appears to have some available
resource
, (b) Kraken is still growing (currently
166TF, adding 450TF any day and adding another 300TF later this year) to provide a total of
~1PF by year’s end, and (c) the Track 2c system w
ill come on line presumably adding ~1PF
more in early 2010. These three machines will represent more than 2.5PF of computing
capability and capacity. Currently TeraGrid provides a total of ~1.2PF across all resources;
most of this is provided by Ranger a
nd
Kraken
.

Resources that are continued should provide a clearly defined benefit to the user community
either through direct provision of resource or by providing a resource for developing/enabling
important new capabilities.

Retain the IA32
-
64 resources to provide a platform for supporting large scale interactive and on
-
demand (including science gateway) use. We have been given clear indication from the user
community, the Science Advisory Board and review panels that we shou
ld put more effort into
this area.

These resources will have the metascheduling CTSS Kit installed and will be allocated as a
single resource. We will also look at advanced reservation capabilities.

These will also support work with OSG to not only su
pport running traditional OSG
-
style jobs
(i.e. single node execution) on TG resources. Will be used to further explore interoperability and
resource and technology sharing. At a minimum they could backfill the schedule, but we might
want to allow them to

have “reasonable” priority, as opposed to how low
-
level parallel jobs are
typically handled by scheduling policies on large systems today.

Finally, these systems will provide a transition platform for those coming from a university
-

or
departmental
-
level
resource and moving out into the larger national cyberinfrastructure
resources suite. Typically such researchers are accustomed to using an Intel cluster and this
provides something familiar with which to expand their usage and to work on scalability and
related issues. These researchers would not be restricted to taking this path and could jump
straight to the Track 2 systems, but many have asked for this type of capability. By making use
of these platforms in this way, we also alleviate the pressure of

smaller jobs on the larger
systems that have been optimized in their configuration and operational policies to favor highly
callable applications.

Systems to be supported are:

Steele: 66TF, 15.7 TB memory, 893 node Dell cluster w/ 130TB disk

QueenBee:
51TF, 5.3TB memory, 668 node Dell cluster w/ 192 TB disk

TeraGrid Extension: Bridging to XD

19

Abe: 90TF, 14.4TB memory, 1200 node Dell cluster w/ 400TB disk

Lonestar: 62TF, 11.6TB memory, 1460 node Dell cluster w/ 107TB disk


Virtual machines could also be supported on these resources. Nee
d to sort out if they will and if
so on what resources


D.8.1.3.1.1

Unique Computing Resource

The Lincoln cluster provides a unique GPU
-
based computing resource at scale. 192 compute
node (3TB memory) + 96 S1070 Tesla unit (1.5TB memory) Dell/NVIDIA cluster w/ 400TB
disk.

NCSA needs to provide text with supporting arguments.


D.8.1.3.2

Supporting Virtual Machines

An emerging need and very interesting area for investigation and evaluation is the use of VMs
to support scientific calculations. There are some groups doing this now

and Quarry at IU
already provides a VM hosting service that is increasingly widely used and unique within the
TeraGrid. (Currently Quarry supports more than 18 VMs for 16 distinct users, many of which
host gateway front
-
end services.) This also has connec
tions to supporting OSG users and we
should have an effort in this area. I believe this is another viable usage modality for the four
cluster resources noted above along with Quarry at IU. (7.1 TF , 112 HS21 Blades in IBM
e1350 BladeCenter Cluster with
266 TB GPFS disk)




D.8.1.3.3

Supporting the Track 2c Transition

Given that the Track 2c machine will likely come up very close to the end of funding for
TeraGrid, a 3 month transition period should be supported for Pople/Cobalt users. To support
this, we should p
rovide 3 months of funding to operate Pople . 384 socket SGI Altix 4700
w/150TB disk.

PSC needs to provide 1
-
2 paragraphs with supporting arguments for this transition.





D.8.1.4

Security

Incident response

TeraGrid Incident Respon
se Team Leadership (Marsteller
): Coordination of TeraGrid weekly
Incident Response calls. Weekly call for all TeraGrid site incident response personnel to
communicate current security attacks, assess threat levels for newly discovered vulnerabilities
and develop response plans to curre
nt threats.

TeraGrid Extension: Bridging to XD

20

Security Services

TeraGrid depends on a set of centralized, core services including a mechanism for obtaining
X.509 credentials for PKI authentication, provision of single sign on across resources provided
by the MyProxy service, and the Kerbe
ros realm for sign on access to the TGUP.


Secure TG Access

How much work remains for this in the Extension? Yr5 said “We will produce a

TeraGrid identity
management infrastructure that interoperates with campus cyberinfrastrure and other grids for a
transparent user experience.



D.8.1.5

Quality Assurance

This activity involves monitoring and internal reporting to maintain operational effectivene
ss.
This will NOT be comepletely transitioned to XD/TAIS when it starts. That will act as an external
audit function and not cover internal monitoring and quality assurance.



D.8.1.6

Common User Environment

Users should be able to move relatively easily among T
G systems and remain productive; too
little coordination hampers user productivity by introducing unnecessary overhead and sources
of error. However, diversity of resources is a strength of TG and so excessive, unnecessary
coordination is an obstacle to sc
aling and to using each resource's specific abilities to the
fullest, which will include learning some resource
-
specific tools and (possibly) policies.






D.9

Education, Outreach, and Training; and External Relations [3 pages


ScottL]


[[Introduce area]]

[[Why is this important? How does it support the science case?]]

[[NOTE: $1M
added
to this area for future allocation
. ScottL needs to provide text for proposal
describing high priority items likely to be funded and refer to process by which things will b
e
decided upon (annual planning process)
]]


D.9.1

Training

Training will focus on expanding the learning resources and opportunities for current and
potential members of the TeraGrid user community by providing a broad range of live,
synchronous and asynchronou
s training opportunities. The goal is to prepare users to make
TeraGrid Extension: Bridging to XD

21

effective use of the TeraGrid resources and services to advance scientific discovery. A key
objective is to make the learning opportunities known and accessible to all users.


D.9.2

Education

and O
utreach

TeraGrid has established a strong foundation in learning and workforce development efforts
focused around computational thinking, computational science, and quantitative reasoning skills
to prepare larger and more diverse generations motivated to p
ursue advanced studies and
professional careers in science, technology, engineering and mathematics (STEM) fields. The
RPs have led, supported, and directly contributed to K
-
12, undergraduate and graduate
education programs across the country.


TeraGrid h
as been conducting a very aggressive outreach program to engage new communities
in using TeraGrid resources and services. The impact of this can be seen in the number of new
DAC (and now Start
-
up and Education) accounts that have been established over the

last few
years.

TeraGrid has been proactive about meeting people “where they live” on their campuses,
at their professional society meetings, and through sharing examples of successes achieved by
their peers in utilizing TeraGrid resources.

Programs inclu
de Campus Champions, Professional
Society Outreach, EOT Highlights, and EOT Newsletter.



D.9.3

Enhancing Diversity and Broader Impacts

Not clear we need a separate subsection on this since at least broader impacts should be
mentioned throughout proposal. Diver
sity plans might be needed here.




D.9.4

External Relations [1.0 FTE @ UC]

To meet NSF, users and public expectations, information about TeraGrid success stories,
including science highlights, news releases, and other news stories should be readily accessible
v
ia the TeraGrid website and distributed among news outlets that reach the scientific user
community and the general public. Such outlets include iSGTW, HPCwire (and NSF OIC). This
work also involves design and preparation of materials for the TeraGrid webs
ite, for conferences
and other outreach activities.

External Relations


science writing, including science impact and news releases, support of
distribution activities including flyer design, conference presence materials, and TeraGrid web
site planning a
nd design.


proven ability in translating complex information to informative prose
will contribute to the TeraGrid objectives to make science impact stories accessible to the non
-
specialist public.


communication activities including design

for the web
site (and
implementation of a content management system), design of outreach flyers, and multi
-
media
design of video loops for SC conferences.

Publish the Science Highlights.


TeraGrid Extension: Bridging to XD

22

D.9.5

International Collaborations

How will TG be working to further develop and iden
tify new international collaborations?



D.10

Organization, Management and Project Management

[2.5

pages


JohnT,
MattH, TimC]


D.10.1

TeraGrid Organization and Management

Organizational structure PI,
Deputy
Director, TG Forum Chair, Director of Science, TG
Forum,
TG ADs, WG leads, etc.


D.10.2

Advisory Groups




D.10.3

Project Management and Budget


PM Working group and coordination.


Financial Management



D.10.4

Personnel

Experienced leadership, dedicated to wisely managing the allotted funding while delivering
reliable res
ources and outstanding service to users will be a key factor in determining the
impact


Principal Investigator:
Ian Foster
.
The principal investigator is
… [[not sure how to deal with
the fact that he is effectively PI in name only here….]]


D.10.4.1

Other Senior P
ersonnel

Need to explain that we did not list co
-
PIs on the proposal since all of the RP PIs play a role at
this level and this is too large a number to list in FastLane. Should perhaps also be brought out
in the management section above.