Rob Pennington - The Coalition for Academic Scientific Computation

levelsordData Management

Nov 20, 2013 (3 years and 8 months ago)

125 views

1




Cyberinfrastructure

for


the 21
st

Century (CIF21): Data


MRI and STCI


EarthCube


CASC

Sept 9, 2011


Rob Pennington

Office of
Cyberinfrastructure

(OCI)

National Science Foundation

rpenning@nsf.gov



1

Framing the Challenge:

Science and Society Transformed by Data


Modern science


Data
-

and compute
-
intensive


Integrative,
multiscale


Multi
-
disciplinary
Collaborations for
Complexity


Individuals, groups,
teams, communities


Sea of Data


Age of Observation


Distributed, central
repositories, sensor
-

driven, diverse, etc


2

Advisory Committee for
Cyberinfrastructure

Task Force Reports

Grand

Challenges

Campus

Bridging

Data and
Viz

Cyberlearning

HPC

HIGH P ERFORMANCE COMPUTING

Software


More than 25 workshops and Birds
of a Feather sessions and more than
1300 people
involved



Final recommendations presented to
the NSF Advisory Committee on
Cyberinfrastructure

(ACCI)

Dec 2010



Final reports on
-
line
at:
http
://
www.nsf.gov
/od/
oci
/taskforce
s
/

3

Data Task Force Recommendations

Infrastructure:



R
ecognize data infrastructure and services (including visualization) as
essential long term research assets fundamental to today’s science

Economic
sustainability:



Develop realistic
cost models to underpin institutional/national
business plans for research repositories/data services

Culture Change
:


Emphasize expectations for data sharing; support the establishment
of new citation models in which data and software tool providers and
developers are recognized and credited with their contributions

Data Management Guidelines
:


Identify and share best
-
practices for the critical areas of data
management

Ethics and IP
:


Train researchers in privacy
-
preserving data access


4

Evolution of
Cyberinfrastructure

for
the 21
st

Century (CIF21) and Data

5

ACCI

Data Task

Force

National
Science
Board (NSB)

DataNet


Program

Community

Input

NSF

CIF21

Data

Programs

On
-
going input

Science &

Engineering

Research

+

Cyberinfrastructure


Discovery

Collaboration

Education

Maintainability, sustainability, and extensibility

Cyberinfrastructure

Ecosystem (CIF21)


Organizations


Universities, schools


Government labs, agencies


Research and Medical Centers


Libraries, Museums


Virtual Organizations


Communities

Expertise


Research and Scholarship


Education


Learning and Workforce Development


Interoperability and operations


Cyberscience

Networking



Campus, national, international networks


Research and experimental networks


End
-
to
-
end throughput


Cybersecurity

Computational
Resources


Supercomputers


Clouds, Grids, Clusters


Visualization


Compute services


Data Centers

Data



Databases, Data repositories


Collections and Libraries


Data Access; storage, navigation


management, mining tools,


curation
, privacy

Scientific Instruments


Large Facilities,
MREFCs,,
telescopes


Colliders, shake Tables


Sensor Arrays


-

Ocean, environment, weather,


buildings, climate. etc

Software



Applications, middleware


Software development and support

Cybersecurity
: access,


authorization, authentication

Discovery

Collaboration

Education

CIF21: Four Major Thrust Areas



Organizations


Universities, schools


Government labs, agencies


Research and Medical Centers


Libraries, Museums


Virtual Organizations


Communities

Expertise


Research and Scholarship


Education


Learning and Workforce Development


Interoperability and operations


Cyberscience

Networking



Campus, national, international networks


Research and experimental networks


End
-
to
-
end throughput


Cybersecurity

Computational
Resources


Supercomputers


Clouds, Grids, Clusters


Visualization


Compute services


Data Centers

Data



Databases, Data repositories


Collections and Libraries


Data Access; storage, navigation


management, mining tools,


curation
, privacy

Scientific Instruments


Large Facilities,
MREFCs,,
telescopes


Colliders, shake Tables


Sensor Arrays


-

Ocean, environment, weather,


buildings, climate. etc

Software



Applications, middleware


Software development and support

Cybersecurity
: access,


authorization, authentication

Data
-
Enabled Science

New Computational

Resources

Community

Research

Networks

Access and

Connections to

CI Resources

Education: integral and embedded

Scientific Data Challenges

8

Bytes per day

2012






2020


Genomics

LHC

TeraGrid
,
Blue

Waters

Square

Kilometer

Array

Genomics

LHC

Climate,

Environment

LSST

Exa

Bytes






Peta

Bytes






Tera

Bytes





Giga

Bytes






Climate,

Environment

Volume

Distribution

Data Access

Many smaller datasets…

DataNet


Support data intensive and multi
-
disciplinary science


Provide reliable digital access, integration,
management and preservation capabilities for science
and engineering data over a decades
-
long timeline


Develop innovative data analysis and mining tools to
support data manipulation, modeling, and discovery


Engage at the frontiers of technological innovation
and transformative science to drive the leading edge
forward




9

CIF21 Data Goals


DataNet

is a strategic part of Foundation
-
wide
investments in data in CIF21


Focus on center

scale awards


DataNet

efforts effectively balance:


Production infrastructure to provide operational services


Research to create next generation infrastructure


DataNet

awards are partnerships


Responsive to user communities to define their
meaningful and useful scope


Form a coordinated network to provide national,
interdisciplinary data models and infrastructure

DataNet

Role
in CIF21

DataNet
: A Multi
-
tiered and Multi
-
Disciplinary Landscape

11

Genomics

Communities

Modeling and Simulation
Communities

Population, Climate,

Environment
Communities

Data
Curation


Data
Storage

Data
-
enabled
Science


DataNet

supported


Data Storage


National storage infrastructure for scientific data


Accommodate

scale
and heterogeneity of scientific data
through robust, open, and broadly accepted standards


Sustainable cost model that can be implemented with
governmental, academic, non profit, and commercial
stakeholders such that it is sustainable.


Make strategic investments that:


Leverage existing resources in
TeraGrid
, commercial clouds,
federal data centers


Meet growing capacity needs at optimum cost


Provide coordinating and integrative functions for integrity,
access control, availability, persistence


Catalyze a national data infrastructure in a similar
role that
NSFNet

played in Internet



12

Data
Curation


Sustainable, community
-
based networks for
management of critical scientific data resources in a
life
-
cycle context.


Overcome challenges of culture change, policy
development and implementation, sustainable
operations, quality and usability control.


Strategic awards that address heterogeneity in
formats, complexity, semantics of data collections
that are valued by science communities of significant
breadth.


Operate as a network of data services that promote
interoperability,
multidisciplinarity
, and scalability.



13

Data Enabled Science


Provide critical tools and services for data
mining, integration, analysis, modeling and
visualization.


Overcome barriers to scaling, synthesis, and
interoperability to promote effective use of
large scale, shared data resources.


Strategic investments that concentrate tools,
resources and expertise in support of
compelling grand challenge science
questions.

14

Cross Cutting
C
hallenges


Balancing
research
into
next generations of
infrastructure with operation & maintenance of
current capacity
.


Stimulate innovation and manage transitions


Sustainable, long term programs


T
echnical
design, development of business models, and
integration with the research cycle.


Integration


Vertical


Linking low
-
level bit storage infrastructure to data
collections, and finally to
applications


Horizontal


Achieving connectivity and interoperability
between activities that vary in scale,
disciplinarity
, and
funding source.

15


Life cycle perspective covering the use of the data


Research, development, implementation, operations,
sustainability, close
-
out


Apply project management methods


WBS, risk management, change control, schedule, milestones,
deliverables


Standardized process:


Evaluate science
merit, conceptual design


D
evelop
draft PEP, design and reporting metrics.


Critical
review


prototype, finalize baseline (approval/mid
-
course correction/off
-
ramp)


Implementation &
operations


subject to change control,
oversight based on milestones & metrics


Final operational review


informs decision for renewal,
termination.

DataNet

Program Management

16

DataNet

Federation Consortium

Data Driven Science


Implement national data grid


Federate existing discipline
-
specific data management
systems to enable national research collaborations


Enable
collaborative research on shared data
collections


Manage collection life cycle as the user community
broadens


Integrate “live” research data into education initiatives


Enable student research participation through control
policies

Project

Shared Collection

Processing Pipeline

Digital Library

Reference Collection

Federation

Collection
Life Cycle

Cyber
-
infrastructure Partners:

Univ. of North Carolina, Chapel Hill

Univ. of California, San Diego

Arizona State University

Drexel
University

Duke University

University of Arizona

University of South Carolina

Science and Engineering Initiatives:

Ocean Observatories Initiative

t
he
iPlant

Collaborative

CUAHSI

CIBER
-
U

Odum

Social Science Institute

Temporal Dynamics of Learning Center

National Science
Foundation
Cooperative Agreement:
OCI
-
0940841

Policy
-
based

data management


CUNY SI
:
Instrumentation for Enabling Data
Analysis, Sharing, Storage, and
Preservation


UC Boulder:
Acquisition of a Scalable
Petascale

Storage Infrastructure for Data
-
Collections and
Data
-
Intensive Discovery


RPI
:
Acquisition of a Balanced Environment for
Simulation


NCA&T:
Acquisition of a Complete High
-
Performance Modeling and Visualization System for
Research in Mathematical Biology and
Mathematical Geosciences


OSU:
Acquisition of a High Performance Compute
Cluster for Multidisciplinary Research

MRI 2011

18

WHAT
IS EARTHCUBE?

A Call to Action

Transitions and Tipping Points in Complex Environmental Systems, NSF AC for Environmental Research and
Education, 2009


Earth Science and Applications from Space: National Imperatives for the Next Decade and Beyond, 2007


High
-
Performance Computing Requirements for the Computational Solid Earth Sciences, 2005


Goal of
EarthCube

T
o transform the conduct of
research in geosciences by
supporting community
-
based
cyberinfrastructure

to
integrate data and
information for knowledge
management across the
Geosciences.

What Needs To Be Done?


Integrate data, tools and communities
through cyberinfrastructure


Establish a governance mechanism that is
inclusive and adopted by the community


Utilize current and emerging technologies to
create transparent infrastructure for the
geosciences community

Modes of
Support

Convergence to a Unifying Architecture

EARTHCUBE ASSUMPTIONS


The geosciences community is ready to take on
the EarthCube challenge


Community will start self
-
organizing
prior to
EarthCube activities, like the Nov 1
-
4 Charrette


Current and emerging technology will help
achieve the convergence envisioned for
EarthCube


A broad range of expertise and resources must
be engaged to shape EarthCube


Jun 2011

Jul
-
Sept 2011

Nov 1
-
4
2011

Nov/11
-
Apr/12

May 2012

DCL

Released

Two

WebEx

events

Charrette


Proposed

Framework

Approaches

Developed through EAGERs

Sandpit/

IdeasLab

to determine

18 mo.

prototype

award(s)

EARTHCUBE
TIMELINE


On
-
line Community Information:


August to November, 2011


EarthCube Charrette:


Early November, 2011


EarthCube

Ideas/Lab:


Tentatively Early May, 2012


Prototype Development:



May to December 2013


Fully integrated geosciences infrastructure:


2014
-
2022


Pre
-
Charrette

Organization

(August


September)


Second WebEx on Aug. 22


NSF seeks input from wide range of sources



Individuals, inst./org., representatives of scientific groups
or communities



Facilities and managers of CI endeavors


Industry, Federal Labs., Federal Agencies, and
International Partners


NSF will establish on
-
line resources and forums to


Gather community inputs/requirements


Facilitate partnerships and collaborations



Encourage submission of approaches to the
EarthCube

design

Charrette

Process


Stakeholders focus EarthCube Ideas and Activities


Plenary Sessions to


discuss user requirements


refine approaches and designs for EarthCube


develop partnerships and new collaborations


Remote participation and real
-
time comments system
will be available


Summary Session


Comments from NSF, facilitators, and participants on
process


NSF provides guidance on post
-
Charrette

activities

29

Questions?