TeraGrid Science Gateways

tanktherapistBiotechnology

Oct 23, 2013 (3 years and 11 months ago)

148 views

TeraGrid Science
Gateways

Nancy Wilkins
-
Diehr

TeraGrid Area Director for Science
Gateways

wilkinsn@sdsc.edu

University of Michigan CI Days,
November 2, 2010

Entrance gate to the “Big House”, Ann Arbor, MI

You’ve heard a lot about the TeraGrid

Here’s a one
-
slide recap of the resources

University of Michigan CI Days,
November 2, 2010

TeraGrid resources today include:


Tightly Coupled Distributed Memory Systems,
2 systems in the top 10 at top500.org


Kraken (NICS): Cray XT5, 99,072 cores, 1.03
Pflop


Ranger (TACC): Sun Constellation, 62,976 cores, 579
Tflop
, 123 TB RAM


Shared Memory Systems


Cobalt (NCSA):
Altix
, 8
Tflop
, 3 TB shared memory


Pople

(PSC):
Altix
, 5
Tflop
, 1.5 TB shared memory


Clusters with
Infiniband


Abe (NCSA): 90
Tflops


Lonestar

(TACC): 61
Tflops


QueenBee

(LONI): 51
Tflops


Condor Pool (Loosely Coupled)


Purdue
-

up to 22,000
cpus


Gateway hosting


Quarry (IU): virtual machine support


Visualization Resources


TeraDRE

(Purdue): 48 node
nVIDIA

GPUs


Spur (TACC): 32
nVIDIA

GPUs


Storage Resources


Wide area
filesystems
(
Lustre
, GPFS)


Archival storage


Data replication service

University of Michigan CI Days,
November 2, 2010

Source: Dan Katz, U Chicago

So how do Gateways fit into this?

Gateways are a natural result of the impact of the internet on
worldwide communication and information retrieval





Implications on the conduct of science are still evolving


1980’s, Early gateways, National Center for Biotechnology Information BLAST
server, search results sent by email, still a working portal today


1989 World Wide Web developed at CERN


1992 Mosaic web browser developed


1995 “International Protein Data Bank Enhanced by Computer Browser”


2004 TeraGrid project director Rick Stevens recognized growth in scientific
portal development and proposed the Science Gateway Program


Today, Web 3.0 and programmatic exchange of data between web pages


Simultaneous explosion of digital information


Growing analysis needs in many, many scientific areas


Sensors, telescopes, satellites, digital images, video, genome sequencers


#1 machine on Top500 today over 1000x more powerful than
all combined
entries on the first list in 1993



University of Michigan CI Days,
November 2, 2010

Only 18 years since the release of Mosaic!

vt100 in the 1980s and a

login window on Ranger today

University of Michigan CI Days,
November 2, 2010

Why are gateways worth the effort?


Increasing range of
expertise needed to tackle
the most challenging
scientific problems


How many details do you want
each individual scientist to need
to know?


PBS, RSL, Condor


Coupling multi
-
scale codes


Assembling data from multiple
sources


Collaboration frameworks


University of Michigan CI Days,
November 2, 2010

#! /bin/sh

#PBS
-
q dque

#PBS
-
l nodes=1:ppn=2

#PBS
-
l walltime=00:02:00

#PBS
-
o pbs.out

#PBS
-
e pbs.err

#PBS
-
V

cd /users/wilkinsn/tutorial/exercise_3

../bin/mcell nmj_recon.main.mdl

+(


&(resourceManagerContact="tg
-
login1.sdsc.teragrid.org/jobmanager
-
pbs")


(executable="/users/birnbaum/tutorial/bin/mcell")


(arguments=nmj_recon.main.mdl)


(count=128)


(hostCount=10)


(maxtime=2)


(directory="/users/birnbaum/tutorial/exercise_3")


(stdout="/users/birnbaum/tutorial/exercise_3/globus.out")


(stderr="/users/birnbaum/tutorial/exercise_3/globus.err")

)

=======

# Full path to executable

executable=/users/wilkinsn/tutorial/bin/mcell


# Working directory, where Condor
-
G will write

# its output and error files on the local machine.

initialdir=/users/wilkinsn/tutorial/exercise_3


# To set the working directory of the remote job, we

# specify it in this globus RSL, which will be appended

# to the RSL that Condor
-
G generates

globusrsl=(directory='/users/wilkinsn/tutorial/exercise_3')


# Arguments to pass to executable.

arguments=nmj_recon.main.mdl


# Condor
-
G can stage the executable

transfer_executable=false


# Specify the globus resource to execute the job

globusscheduler=tg
-
login1.sdsc.teragrid.org/jobmanager
-
pbs


# Condor has multiple universes, but Condor
-
G always
uses globus

universe=globus


# Files to receive sdout and stderr.

output=condor.out

error=condor.err


# Specify the number of copies of the job to submit to the
condor queue.

queue 1

Gateways democratize access to high
end resources


Almost anyone can investigate scientific questions using high
end resources


Not just those in the research groups of those who request allocations


Gateways allow anyone with a web browser to explore


Opportunities can be uncovered via
google


My then 11
-
year
-
old son discovered nanoHUB.org when his science
class was studying Bucky Balls


Foster
new ideas, cross
-
disciplinary approaches


Encourage students to experiment


But used in production too


Significant number of papers resulting from gateways including
GridChem
,
nanoHUB


Scientists can focus on challenging science problems rather than
challenging infrastructure problems

University of Michigan CI Days,
November 2, 2010

Today, there are approximately 35
gateways using the TeraGrid

University of Michigan CI Days,
November 2, 2010

This just in 35% of TeraGrid users charging jobs


(June
-
Sept, 2010) were gateway users!

Not just ease of use

What can scientists do that they
couldn’t do previously?


Linked Environments for Atmospheric Discovery (LEAD)
-

radar data coupled with on demand computing


Large Synoptic Survey Telescope (LSST)


access to sky
surveys


Ocean Observing Initiative (OOI)


access to sensor data


PolarGrid



access to polar ice sheet data


SIDGrid



expensive datasets, analysis tools


GridChem


coupling
multiscale

codes



How would this have been done before gateways?

University of Michigan CI Days,
November 2, 2010

3 steps to connect a gateway to
TeraGrid


Request an allocation


Only a 1 paragraph abstract
required for up to 200k CPU hours


Register your gateway


Visibility on public TeraGrid

page


Request a
community
account


Run jobs for others via your portal


Staff support is available!


www.teragrid.org/gateways

University of Michigan CI Days,
November 2, 2010

Tremendous Opportunities Using the Largest Shared Resources
-


Challenges too!


What’s different when the resource doesn’t belong just to
me?


Resource discovery


Accounting


Security


Proposal
-
based requests for resources (peer
-
reviewed access)


Code scaling and performance numbers


Justification of resources


Gateway citations


Tremendous benefits at the high end, but even more work
for the developers


Potential impact on science is huge


Small number of developers can impact thousands of scientists


But need a way to train and fund those developers

University of Michigan CI Days,
November 2, 2010

When is a gateway appropriate?


Researchers using defined sets of tools in different ways


Same executables, different input


GridChem, CHARMM


Creating multi
-
scale workflows


Datasets


Common data formats


National Virtual Observatory


Earth System Grid


Some groups have invested significant efforts here


caBIG, extensive discussions to develop common terminology and formats


BIRN, extensive data sharing agreements


Difficult to access data/advanced workflows


Sensor/radar input


LEAD, GEON

University of Michigan CI Days,
November 2, 2010

How to get started?


Conduct a needs assessment


Should I build a gateway?


Can I use an existing gateway?


What problems am I trying to solve?


All gateways don’t need high end computing


Decide on a software approach


Recommended software at www.teragrid.org


Targeted effort by a few can benefit many


Could a pool of developers design gateways for different domain
areas? Yes!


TeraGrid staff assistance

University of Michigan CI Days,
November 2, 2010

Expressed Sequence Tag (EST) Pipeline


Take raw genome data in the FASTA format and run a series
of applications on it


RepeatMasker
,
PaCE
, CAP3 and BLAST used to generate the final
assembled output


Very variable run times (milliseconds to days)


EST Pipeline based on the SWARM Web Service that
provides a web service interface to clients and also manages
the bulk job submission using the Birdbath API to submit to
Condor


2M jobs run in 49 hours, only a handful of failures


Workflow is configured using a PHP based gateway that
allows users to upload input data and select programs to run


University of Michigan CI Days,
November 2, 2010

Source: Archit Kulshrestha, IU

Cyberinfrastructure

for
Phylogenetic

Research (CIPRES)

www.phylo.org


Enables large
-
scale
phylogenetic

reconstructions


Parallel “fastest in the west”
versions of applications such
as
MrBayes
,
Raxml

and
Garli


Easy to use graphical user
interface


Over 800 users, June
-
Sept


27% of all active TG users!!


5M CPU hours awarded

University of Michigan CI Days,
November 2, 2010

Intellectual Merit:



the CIPRES portal is cited in at least 35 publications




this includes publications in Nature, PNAS, and Cell.




highlights of scientific findings:


New Family Tree for Arthropoda:
A team of scientists compared genetic
sequences from 75 arthropod species and drew a new family tree for the
most successful phylum of animals on Earth. This work represents an
important advance in the century
-
old problem of arthropod evolution.


Genome Sequence of a Transitional Eukaryote:
A group of scientists
sequenced the genome of Naegleria

gruberi,

a single
-
cell organism that is a
key transitional species between prokaryotes and eukaryotes. This work
provides new insights into the origins of subcellular organelles.


Co
-
evolution of Beetles and Flowering Plants:
A group of researchers studied
the evolutionary history of angiosperms and the beetles that interact with
them. The work provided compelling experimental evidence for the long
-
postulated co
-
evolution of these two symbiotic groups.


Source: Mark Miller, SDSC

University of Michigan CI Days,
November 2, 2010

Broad Impacts:



77% of all jobs have been submitted from locations in the USA.
Submissions are received regularly from researchers at top
-
tier institutions
such as Harvard, Yale, and Stanford.




Jobs are received regularly from academic institutions in 17 EPSCOR
states.




Job submissions have been received from 34 countries on 5 continents.



At least 5 undergraduate classes are known to use the portal routinely. This
is likely an underestimate (based on Web log patterns).



More than 45,000 jobs have been run on the Portal over its lifetime.
Between Dec 1, 2010 and June 30, 2010, users ran 6,108 parallel jobs on the
TeraGrid.


Source: Mark Miller, SDSC

University of Michigan CI Days,
November 2, 2010

Additional Gateways for Biology


www.teragrid.org/gateways


List of all TeraGrid gateways



Biodrugscore


RENCI Science Portal


Open Life Sciences Gateway


Robetta

University of Michigan CI Days,
November 2, 2010

Biodrugscore

www.biodrugscore.org


Derive and validate scoring functions


Create training sets using structural and binding data from multiple
databases including
PDBbind

and
PDBcal


Define the components of scoring functions by picking from among a
list of pre
-
computed terms



Partial least
-
squares regression analysis


Validate scoring functions


Apply custom scoring functions for the ranking of
chemical libraries that are pre
-
docked against a large
set of binding cavities from the human proteome


If the receptor of interest is not available,
biodrugscore

makes it
possible for users to dock libraries against their target on the
TeraGrid using their own account.

University of Michigan CI Days,
November 2, 2010

NBCR

www.nbcr.net


Compute resources


Service projects


Quantum to Continuum
Mechanics Tools


Data Analysis Tools for
Molecular Sequences


Heart Modeling


Visualization and multi
-
scale modeling


Grid services and
Telescience


Tools and downloads


40+ packages, databases,
services

University of Michigan CI Days,
November 2, 2010

RENCI Science Portal


https://portal.renci.org/portal/


125 biology applications


From Antigenic to
WordMatch

and
everything in between


RENCI Science Desktop


BlastMaster

desktop


University of Michigan CI Days,
November 2, 2010

Open Life Sciences Gateway

http://lsgw.uc.teragrid.org


Bioinformatics applications and data
collections


Portal access, direct Web services calls, workflows with
Taverna


And now
google

gadgets!


igoogle.google
..com, “add stuff”, search for TeraGrid

University of Michigan CI Days,
November 2, 2010

Robetta

http://www.robetta.org

University of Michigan CI Days,
November 2, 2010


Protein structure
prediction server


Rosetta code from the
David Baker laboratory


Also available


RosettaAntibody

Server


RosettaDesign

Server


RosettaDock

Server


Rosetta Commons


FoldIt


Rosetta@home


Human Proteome Folding
Project


Linked Environments for Atmospheric
Discovery (LEAD)


Providing tools that are needed to make accurate

predictions of tornados and hurricanes


Meteorological data


Forecast models


Analysis and visualization tools


Data exploration and Grid workflow

University of Michigan CI Days,
November 2, 2010

Highlights: LEAD Inspires Students

Advanced capabilities regardless of location


A student gets excited about what he
was able to do with LEAD


“Dr. Sikora:Attached is a display of 2
-
m T and wind depicting the WRF's
interpretation of the coastal front on
14 February 2007. It's interesting that
I found an example using IDV that
parallels our discussion of mesoscale
boundaries in class. It illustrates very
nicely the transition to a coastal low
and the strong baroclinic zone with a
location very similar to Markowski's
depiction. I created this image in IDV
after running a 5
-
km WRF run
(initialized with NAM output) via the
LEAD Portal. This simple 1
-
level plot
is just a precursor of the many
capabilities IDV will eventually offer to
visualize high
-
res WRF output. Enjoy!



Eric” (email, March 2007)


University of Michigan CI Days,
November 2, 2010

Community Climate System Model
(CCSM)


Makes a world
-
leading, fully
coupled climate model easier
to use and available to a
wide audience


Compose, configure, and
submit CCSM simulations to
the TeraGrid


Used in Purdue’s POL 520/EAS
591: Models in Climate Change
Science and Policy


Semester
-
long projects, 100 year
CCSM simulations, generate
policy recommendations based
on scientific, economic, and
political models of climate
change impacts

University of Michigan CI Days,
November 2, 2010

Analytical Ultracentrifugation

Emerging computational tool for the study of proteins


The Center for Analytical
Ultracentrifugation of Macromolecular
Assemblies, UT Health Sciences


Major advances in the
characterization of proteins and
protein complexes as a result of
new instrumentation and powerful
software


Monitoring the sedimentation of
macromolecules in real time in the
centrifugal field allows their
hydrodynamic and thermodynamic
characterization in solution


Observations are electronically
digitized and stored for further
mathematical analysis


http://uslims.uthscsa.edu/

University of Michigan CI Days,
November 2, 2010

Source: Modern analytical ultracentrifugation in protein science: A tutorial review, Wikipedia

UltraScan provides a comprehensive
data analysis environment


Management of analytical ultracentrifugation data for single
users or entire facilities


Support for storage, editing, sharing and analysis of data


HPC facilities used for 2
-
D spectrum analysis and genetic algorithm
analysis


TeraGrid (~2M CPU hours used)


Technische University of Munich


Juelich Supercomputing Center


Portable graphical user interface


MySQL database backend for data management


Over 30 active institutions


TeraGrid advanced support


Fault tolerance, workflows, use of multiple TG resources, community
account implementation

University of Michigan CI Days,
November 2, 2010

Social Informatics Data Grid

Collaborative access to large, complex datasets


SIDGrid is unique among
social science data archive
projects


Streaming data which change
over time


Voice, video, images (e.g. fMRI),
text, numerical (e.g. heart rate,
eye movement)


Investigate multiple datasets,
collected at different time
scales, simultaneously


Large data requirements


Sophisticated analysis tools

University of Michigan CI Days,
November 2, 2010

http://www.ci.uchicago.edu/research/files/sidgrid.mov

Viewing multimodal data like a
symphony conductor


“Music
-
score” display and
synchronized playback of video
and audio files


Pitch tracks


Text


Head nods, pause, gesture
references


Central archive of multi
-
modal
data, annotations, and analyses


Distributed annotation efforts by
multiple researchers working on a
common data set


History of updates


Computational tools


Distributed acoustic analysis using
Praat


Statistical analysis using R


Matrix computations using Matlab
and Octave

University of Michigan CI Days,
November 2, 2010

Source: Studying Discourse and Dialog with SIDGrid, Levow, 2008

Future Technical Areas


Web technologies change fast


Must be able to adapt quickly


Gateways and gadgets


Gateway components incorporated
into any social networking page


75% of 18 to 24 year
-
olds have
social networking websites


iPhone apps?


Web 3.0


Beyond social networking and
sharing content


Standards and querying interfaces
to programmatically share data
across sites


Resource Description Framework (RDF),
SPARQL

University of Michigan CI Days,
November 2, 2010

Gateways can further investments in
other projects


Increase access


To instruments, expensive data collections


Increase capabilities


To analyze data


Improve workforce development


Can prepare students to function in today’s cross
-
disciplinary world


Increase outreach


Increase public awareness


Public sees value in investments in large facilities


Pew 2006 study indicates that half of all internet users have been to a
site specializing in science


Those who seek out science information on the internet are more likely
to believe that scientific pursuits have a positive impact on society

University of Michigan CI Days,
November 2, 2010

But gateways can only be truly effective if they are persistent

Gateway Sustainability Study


Characteristics of short funding cycles


Build exciting prototypes with input from
scientists


Work with early adopters to extend
capabilities


Tools are publicized, more scientists
interested


Funding ends


Scientists who invested their time to use
new tools are disillusioned


Less likely to try something new again


Start again on new short
-
term project


Need to break this cycle


EAGER grant to look at characteristics
of successful gateways and domain
areas where a gateway could have a big
impact


Working with Katherine Lawrence, UM

University of Michigan CI Days,
November 2, 2010

4 focus group meetings over 2 years

First 2 held June, 2010


www.sciencegateways.org


University of Michigan CI Days,
November 2, 2010

Thank you for your attention!

Questions?




Nancy Wilkins
-
Diehr,
wilkinsn@sdsc.edu

www.teragrid.org