U.S. Department of Energy
Office of Science
U.S. Department of Energy
Office of Science
New Opportunities for Data and
Information Management:
Finding the Dots, Connecting the Dots,
Understanding the Dots
Raymond L. Orbach
Director,
Office of Science
2006 AAAS Annual Meeting
February 19, 2006
St. Louis, MO
U.S. Department of Energy’s
Office of Science
U.S. Department of Energy
Office of Science
February 19, 2006
2
Supports basic research that underpins DOE
missions
Constructs and operates large scientific
facilities for the U.S. scientific community
Accelerators, synchrotron light sources, neutron sources
Seven Program Offices
Advanced Scientific Computing Research (ASCR)
Basic Energy Sciences (BES)
Biological and Environmental Research (BER)
Fusion Energy Sciences (FES)
High Energy Physics (HEP)
Nuclear Physics (NP)
Workforce Development (WD)
DOE Office of Science
U.S. Department of Energy
Office of Science
February 19, 2006
3
The FY 2007 President’s Request for science
funding is a 14.1% increase and sets the Office of
Science on a path to doubling by 2016
An historic
opportunity for our
country
–
a
renaissance for
U.S. science and
continued global
competitiveness.
Office of Science Budget
Doubling from FY 2006 to FY 2016
0
1
2
3
4
5
6
7
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
Fiscal Year
Budget Authority
As Spent Dollars in Billions
FY 1995 level
plus inflation
SC budget doubles
to $7.2B in FY 2016
from $3.6B in FY 2006
U.S. Department of Energy
Office of Science
February 19, 2006
4
Data Storage Funding
FY 2006
FY 2007
Data Storage Funding
Including R&D
$ 34M
$ 37.6M
(ASCR+HEP+NP)
Current experiment and simulation data storage capacity
for the Office of Science is about 100 petabytes and is
expected to more than double by FY 2009
U.S. Department of Energy
Office of Science
February 19, 2006
5
Data Sources
Three Pillars of Scientific Discovery:
Experiment, Theory, and Simulation
Two different kinds of very large data sets:
Experimental data
High energy physics, environment and climate
observation data, biological mass

spectrometry
Data needs to be retained for long term
Simulation data
Astrophysics, climate, fusion, catalysis, QCD
From computationally expensive large simulations
Post processing of data using quantum Monte
Carlo, analytics and graphical analysis,
perturbation theory, and molecular dynamics
U.S. Department of Energy
Office of Science
February 19, 2006
6
PetaCache Project
HEP Data Analysis: Beyond Data Mining
BaBar Data Challenge:
•
2 petabytes stored, 10

100 terabytes intense access/inquiry
•
1
–
15 kilobytes (small) data objects
•
Hundreds of users, thousands of batch jobs
PetaCache
project
(SLAC: David Leith and Richard Mount)
Revolutionize access to huge datasets:
•
First innovative solid

state disk as intermediate storage for HPC
data searches
•
100 times smaller latency than disk
•
At least 500 times faster throughput than disk
•
Builds Feature Database structures to accelerate the retrieval of
data
Expected Impact
BaBar: From analyst’s idea to seeing the result
–
nine months
becomes one day.
U.S. Department of Energy
Office of Science
February 19, 2006
7
Sheer Volume of Data
Climate
Now: 20

40 Terabytes/year
5 years: 5

10 Petabytes/year
Fusion
Now: 100 Megabytes/15 min
5 years: 1000 Megabytes/2 min
Advanced Mathematics
and Algorithms
Huge dimensional space
Combinatorial challenge
Complicated by noisy data
Requires high

performance
computers
Providing Predictive
Understanding
Produce hydrogen

based energy
Stabilize carbon dioxide
Clean and dispose toxic
waste
Finding the Dots
Understanding the Dots
Connecting the Dots in Science
ORNL: Nagiza Samatova
U.S. Department of Energy
Office of Science
February 19, 2006
8
Connecting the Dots in Combustion,
Fusion, and Structural Biology
Finding the DOTS

Large

scale simulations in support of combustion grand challenges are generating terabytes of data per
simulation. Of particular interest in these simulations are transient events such as ignition, extinction, and re

ignition, whic
h are not
well understood. Similar problems also exist in high

resolution, ultra

high speed images of edge turbulence in the National Sphe
rical
Torus Experiment at PPPL. In structural biology, the interaction between two proteins forming a molecular machine can be des
cri
bed
as the set of contacting amino acid residues. The set of features is very large, and is generated by the combinations of diff
ere
nt
chemical identities, orientation patterns, and spatial arrangement of the residues.
Connecting the DOTS
–
In combustion, it is unclear what features in the simulation data and their nonlinear dynamic effects could be
used to characterize such events. Simulations need to be carried out to explore different possibilities. In fusion, extractin
g f
eatures
that could characterize the plasma blobs is relevant to the analysis of Poincar
é
sections for the particle orbits. For the two interacting
proteins,
the number of the distinctly different variants of subunits forming the molecular machine is millions or billions, eve
n after
applying sophisticated filtering algorithms. The correlations between the subunits establishes the connection between the dot
s.
Understanding the DOTS
–
•
A complete understanding the correlations and chemical reactions inherent in the turbulent flow during combustion is still b
eyo
nd our
reach.
•
In fusion, each particle orbit in a Poincar
é
section is generated when a particle intersects a plane perpendicular to the magnetic axis.
Identifying and classifying the orbits is of significant importance in understanding and stabilizing the plasma.
•
Multiple connectable groups of amino acids can be constructed for the interacting proteins, with probabilities giving the lik
eli
hood for
each variant. Finding the "optimal" solution is important. For example, high scoring interfaces may represent a dynamic pictu
re
of the
protein machine workings, or additional "ports" suitable for yet

not

discovered protein subunits and other co

factors.
U.S. Department of Energy
Office of Science
February 19, 2006
9
Office of Science
Decadal Data Challenge
Mathematical and Computational Challenges and Needs
“Curse of Dimensionality”

Interpretation of high dimensional data
Challenges:
Going beyond classical Bayesian theory of probabilistic quantification to address long range
and non

linear correlations between features in noisy data
Mathematical description of complex geometric shapes in their spatial and temporal
dimensions
Enumeration and optimization of multivariate functions on complex graphs that describe
relationships between identified features
Low rank approximations and generalized separation of variables to reduce the
dimension with out destroying information
New harmonic and discrete mathematics and new algorithms for fast extraction of
correlations and patterns
U.S. Department of Energy
Office of Science
February 19, 2006
10
Office of Science Response to
the Data Challenge
The Office of Science will initiate a long

term research program to address the “Curse
of Dimensionality.” Some of the elements of the research program are:
Bayesian Theory
–
New research to develop efficient ways for dealing with both local and long

range
correlations between features, including Bayesian estimators to correctly estimate the simultaneous
appearance of “striking” features at precisely defined locations, and mechanisms to incorporate partial
analytical models to supplement missing statistics.
Mathematical description of complex geometric shapes
–
New research on the stochastic theory
of shapes to classify geometric shapes in terms of stochastic models, which are essential for the
rigorous comparisons needed for pattern discovery. We intend to develop high performance scalable
algorithms for querying, searching, tracking, and reconstruction of high dimensional shapes from
incomplete information.
Enumeration and optimization of multivariate functions on complex graphs
–
New research to
develop efficient methodologies for the hierarchical enumeration of composite objects, including
analytical methods for dynamically constraining the search space. We intend to develop optimization
methods to deal with novel spaces formed by graphs of identified features (dots) and their
relationships (connections). Such spaces typically have hundreds of variables and dimensions.
Additionally, we intend to develop computational libraries to efficiently handle an enormous number of
possible variants through construction of subgraph indexing schemes and efficient lookup methods.
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Commentaires 0
Connectezvous pour poster un commentaire