CyberInfrastructure to Support Scientific Exploration and Collaboration

photofitterInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 3 χρόνια και 9 μήνες)

55 εμφανίσεις

CyberInfrastructure

to
Support Scientific
Exploration and
Collaboration



Dennis Gannon

(based on work with many collaborators,

most notably Beth Plale )

School of Informatics

Indiana University


Overview


CyberInfrastructure for Virtual Organizations
in Computational Science


Science Portals and The Gateway Concept


Automating the Trail From Data to Scientific
Discovery


An Example in Depth: Mesoscale Storm Prediction


The challenge for the Individual Researcher


Connecting tools to on
-
line services


The Promise of the Future


Multicore, personal petabyte and gigabit
bandwidth

The Realities of Science in the U.S.


“Big Science” dominates the funding
hierarchy.


Why? Its important and easy to sell to
congress.


The NSF is investing in vast network
of supercomputers to support big
science


The results are empowering a broad
range of scientific communities.


Where is the single investigator?


The Web has enabled democratization of
information access


Is there a similar path for access
advanced computational resources?

Democratizing Access to Science


What is needed for the individual or
small team to do large scale science?


Access to data and the tools to analyze it
and transform it.


A means to publish the not just the results
of a study but a way to share the path to
discovery.


Where are the resources?


What we have now: TeraGrid


What is emerging?

The TeraGrid


The US National Supercomputer Grid


CyberInfrastructure composed of a set of resources
(compute and data) that provide common services for


Wide area data management, Single sign
-
on user authentication


Distributed Job scheduling and management. (in the works.)



Collectively


1Petaflop


20 Petabytes


Soon to triple.


Will add a petaflop
each year.


But at a slower
rate than google,
ebay, amazon add
resources.

TeraGrid Wide: Science Gateways


Science Portals


A Portal = a web
-
based home+personal
workspace + personal tools.


Web Portal Technology + Grid Middleware


Enables a community of researcher to:


Access to shared resources (both data and
computation)


A forum to collaboration on shared problem
solving


TeraGrid Science Gateways


Allow the TeraGrid to be the back
-
end resource.

NEESGrid

Realtime access to earthquake

Shake table experiments at remote sites.

BIRN


Biomedical Information

Geological Information Grid Portal

Renci Bio Portal

Providing access to biotechnology tools running on a back
-
end Grid.


-

leverage state
-
wide


investment in


bioinformatics

-

undergraduate &


graduate education,


faculty research

-

another portal


soon:


national evolutionary


synthesis center

Nanohub
-

nanotechnology

X
-
Ray Crystallography

ServoGrid Portal

The LEAD Project

Predicting Storms


Hurricanes and tornadoes cause massive
loss of life and damage to property


Underlying physical systems involve highly
non
-
linear dynamics so computationally
intense


Data comes from multiple sources


“real time” derived from streams of data from
sensors


Archived in databases of past storms


Infrastructure challenges:


Data mine instrument radar data for storms


Allocate supercomputer resources
automatically to run forecast simulations


Monitor results and retarget instruments.


Log provenance and metadata about
experiments for auditing.

Analysis/Assimilation


Quality Control

Retrieval of Unobserved

Quantities

Creation of Gridded Fields



Prediction/Detection


PCs to Teraflop Systems

Product Generation,


Display,

Dissemination


End Users


NWS

Private Companies

Students


Traditional Methodology

STATIC OBSERVATIONS


Radar Data

Mobile Mesonets

Surface Observations

Upper
-
Air Balloons

Commercial Aircraft

Geostationary and Polar Orbiting
Satellite

Wind Profilers

GPS Satellites


The Process is Entirely Serial

and Static (Pre
-
Scheduled):

No Response to the Weather!

Analysis/Assimilation


Quality Control

Retrieval of Unobserved

Quantities

Creation of Gridded Fields



Prediction/Detection


PCs to Teraflop Systems

Product Generation,


Display,

Dissemination


End Users


NWS

Private Companies

Students


The LEAD Vision: Enabling a new paradigm of scientific
exploration.

DYNAMIC OBSERVATIONS

Models and Algorithms Driving Sensors

The CS challenge:


* Build cyberinfrastructure services that provide


adaptability, scalability, availability, useability.


* Create a new paradigm of meteorology research.


Building Experiments that Respond to the Future

Can we pose a scientific search and
discovery query that the cyber
infrastructure executes as our agent?




In the LEAD case it is Data Driven, Persistent and
Agile


Weather data streams define nature of computation


Mine the data streams, detect “interesting” features,
event triggers workflow scenario that has been waiting
for months.

The LEAD Gateway Portal


To support three classes of users


Meteorology research scientists & grad students.


Undergrads in meteorology classes


People who want easy access to weather data.

Go to:

http://www.leadproject.org

Gateway Components


A Framework for Discovery


Four basic components


Data Discovery


Catalogs and index services


The experiment


Computational workflow managing on
-
demand
resources


Data analysis and visualization


Data product preservation,


automatic metadata generation and experimental
data providence.



Data Search


Select a region and a time range and desired attributes

Building Experiments


As the user interacts with the portal they
are creating “experiments”


An experiment is


A collection of data (or desired data)


A set of analysis, transformational or
predictive tasks


Defined by a workflow or a high level query


A provenance document that encodes a
repeatable history of the experiment.

Portal: Experimental Data & Metadata Space


CyberInfrastructure extends user’s
desktop to incorporate vast data
analysis space.


As users go about doing scientific
experiments, the CI manages back
-
end storage and compute resources.


Portal provides ways to explore this
data and search and discover it.


Metadata about experiments is
largely
automatically generated, and
highly searchable
.


Describes data object (the file) in
application
-
rich terms, and provides
URI to data service that can resolve
an abstract unique identifier to real,
on
-
line data “file”.

arpssfc

arpstrn

Ext2arps
-
ibc

88d2arps

mci2arps

ADAS

assimilation

arps2wrf

nids2arps

WRF

Ext2arps
-
lbc

wrf2arps

arpsplot

IDV viz

Terrain data files

Surface data files

ETA, RUC, GFS data

Radar data (level II)

Radar data (level III)

Satellite data

Surface, upper air
mesonet & wind
profiler data

Typical weather forecast runs as workflow

~400 Data Products Consumed &
Produced


transformed



during
Workflow Lifecycle

Pre
-
Processing

Assimilation

Forecast

Visualization

The Experiment Builder


A Portal “wizard” that leads the user
through the set
-
up of a workflow


Asks the user:


“Which workflow do you want to run?”


Once this is know, it can prompt the
user for the required input data sources


Then it “launches” the workflow.

Parameter Selection


Selecting the forecast region



Experience so far


First release to support “WxChallenge: the new collegiate
weather forecast challenge”


The goal: “forecast the maximum and minimum temperatures,
precipitation, and maximum sustained wind speeds for select U.S.
cities.


to provide students with an opportunity to compete against their
peers and faculty meteorologists at 64 institutions for honors as the
top weather forecaster in the nation.”


79 “users” ran 1,232 forecast workflows generating 2.6TBybes of
data.


Over 160 processors were reserved on Tungsten from 10am to 8pm
EDT(EST), five days each week


National Spring Forecast


First use of user initiated 2Km forecasts as part of that program.
Generated serious interest from National Severe Storm Center.


Integration with CASA project scheduled for final year of LEAD
ITR.

Is TeraGrid the Only Enabler?


The web has evolved a set information and service
“super nodes”


Directories & indexes (google, MS, Yahoo)


Transactional mosh pits (eBay, Facebook, Wikipedia)


Raw data and compute services (Amazon …)


We can build the tools for scientific discovery on
this “private sector” grid?


Yes.


One CS student + one Bio
-
informatician + Amazon
Storage Service + Amazon Compute Cloud = ..

A Virtual Lab for Evolutionary Genomics


Data and databases live on S3


Computational Tools run (on
-
demand) as services on EC2.


User composes workflows.


Result data and metadata visible to
user through desktop client.

Validating Scientific Discovery


The Gateway is becoming part of
the process of science by being an
active repository of data
provenance


The system records each
computational experiment that a
user initiates


A complete audit trail of the experiment
or computation


Published results can include link to
provenance information for
repeatability and transparency.


The Scientific Method is all about
repeatability of experiments


Are we there yet?

Almost


The provenance contains the workflow
and if we publish it, it can be re
-
run


Are the same resources still available?


Not a necessary condition for validation


Has the data changed?


Another user can modify it.


Replace an analysis step with another


Test it on different data.

The Future Experimental Testbed


In five years multicore, personal
petabytes and ubiquitous gigabit
bandwidth


Much richer experimental capability on the
desktop. More of the computational work
can be downloaded


Do we no longer need the massive
remote data/compute center?


Demand scales with capability.


But there is more.

Last Thought


Vastly improved capability for interactive
experimentation


Data exploration and visualization. Interacting
with hundreds of incoming data streams.


Tracking our path and exploring 100 possible
experimental scenarios concurrently.


Deep search agents


Discovering new data and new tools


Grab data
-

automatically fetch and analyze the
provenance and set up the workflow to be re
-
run.

Questions

The Realization in Software

Data Storage

Application

services

Compute Engine

User Portal

Portal

server

Data

Catalog

service

MyLEAD User

Metadata

catalog

MyLEAD

Agent

service

Data

Management

Service

Workflow

Engine

Workflow graph

Providence

Collection

service

Event Notification Bus

Fault

Tolerance

& scheduler