Office of Science

busyicicleΜηχανική

22 Φεβ 2014 (πριν από 3 χρόνια και 3 μήνες)

74 εμφανίσεις

U.S. Department of Energy

Office of Science

U.S. Department of Energy

Office of Science

New Opportunities for Data and
Information Management:

Finding the Dots, Connecting the Dots,
Understanding the Dots

Raymond L. Orbach

Director,

Office of Science

2006 AAAS Annual Meeting

February 19, 2006

St. Louis, MO

U.S. Department of Energy’s

Office of Science

U.S. Department of Energy

Office of Science

February 19, 2006

2


Supports basic research that underpins DOE
missions


Constructs and operates large scientific
facilities for the U.S. scientific community


Accelerators, synchrotron light sources, neutron sources


Seven Program Offices


Advanced Scientific Computing Research (ASCR)


Basic Energy Sciences (BES)


Biological and Environmental Research (BER)


Fusion Energy Sciences (FES)


High Energy Physics (HEP)


Nuclear Physics (NP)


Workforce Development (WD)

DOE Office of Science

U.S. Department of Energy

Office of Science

February 19, 2006

3

The FY 2007 President’s Request for science
funding is a 14.1% increase and sets the Office of
Science on a path to doubling by 2016


An historic
opportunity for our
country


a
renaissance for
U.S. science and
continued global
competitiveness.

Office of Science Budget
Doubling from FY 2006 to FY 2016
0
1
2
3
4
5
6
7
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
Fiscal Year
Budget Authority
As Spent Dollars in Billions
FY 1995 level
plus inflation
SC budget doubles
to $7.2B in FY 2016
from $3.6B in FY 2006
U.S. Department of Energy

Office of Science

February 19, 2006

4

Data Storage Funding







FY 2006


FY 2007

Data Storage Funding



Including R&D




$ 34M



$ 37.6M

(ASCR+HEP+NP)


Current experiment and simulation data storage capacity

for the Office of Science is about 100 petabytes and is

expected to more than double by FY 2009


U.S. Department of Energy

Office of Science

February 19, 2006

5

Data Sources

Three Pillars of Scientific Discovery:

Experiment, Theory, and Simulation


Two different kinds of very large data sets:



Experimental data


High energy physics, environment and climate
observation data, biological mass
-
spectrometry


Data needs to be retained for long term




Simulation data


Astrophysics, climate, fusion, catalysis, QCD


From computationally expensive large simulations


Post processing of data using quantum Monte
Carlo, analytics and graphical analysis,
perturbation theory, and molecular dynamics

U.S. Department of Energy

Office of Science

February 19, 2006

6


PetaCache Project

HEP Data Analysis: Beyond Data Mining



BaBar Data Challenge:




2 petabytes stored, 10
-
100 terabytes intense access/inquiry



1

15 kilobytes (small) data objects



Hundreds of users, thousands of batch jobs




PetaCache
project

(SLAC: David Leith and Richard Mount)



Revolutionize access to huge datasets:


First innovative solid
-
state disk as intermediate storage for HPC
data searches


100 times smaller latency than disk


At least 500 times faster throughput than disk


Builds Feature Database structures to accelerate the retrieval of
data



Expected Impact


BaBar: From analyst’s idea to seeing the result


nine months
becomes one day.

U.S. Department of Energy

Office of Science

February 19, 2006

7

Sheer Volume of Data

Climate

Now: 20
-
40 Terabytes/year

5 years: 5
-
10 Petabytes/year

Fusion

Now: 100 Megabytes/15 min

5 years: 1000 Megabytes/2 min

Advanced Mathematics
and Algorithms


Huge dimensional space


Combinatorial challenge


Complicated by noisy data


Requires high
-
performance
computers

Providing Predictive
Understanding


Produce hydrogen
-
based energy


Stabilize carbon dioxide


Clean and dispose toxic


waste

Finding the Dots

Understanding the Dots

Connecting the Dots in Science


ORNL: Nagiza Samatova

U.S. Department of Energy

Office of Science

February 19, 2006

8

Connecting the Dots in Combustion,

Fusion, and Structural Biology

Finding the DOTS
-

Large
-
scale simulations in support of combustion grand challenges are generating terabytes of data per
simulation. Of particular interest in these simulations are transient events such as ignition, extinction, and re
-
ignition, whic
h are not
well understood. Similar problems also exist in high
-
resolution, ultra
-
high speed images of edge turbulence in the National Sphe
rical
Torus Experiment at PPPL. In structural biology, the interaction between two proteins forming a molecular machine can be des
cri
bed
as the set of contacting amino acid residues. The set of features is very large, and is generated by the combinations of diff
ere
nt
chemical identities, orientation patterns, and spatial arrangement of the residues.


Connecting the DOTS


In combustion, it is unclear what features in the simulation data and their nonlinear dynamic effects could be
used to characterize such events. Simulations need to be carried out to explore different possibilities. In fusion, extractin
g f
eatures
that could characterize the plasma blobs is relevant to the analysis of Poincar
é

sections for the particle orbits. For the two interacting
proteins,

the number of the distinctly different variants of subunits forming the molecular machine is millions or billions, eve
n after
applying sophisticated filtering algorithms. The correlations between the subunits establishes the connection between the dot
s.


Understanding the DOTS




A complete understanding the correlations and chemical reactions inherent in the turbulent flow during combustion is still b
eyo
nd our
reach.


In fusion, each particle orbit in a Poincar
é

section is generated when a particle intersects a plane perpendicular to the magnetic axis.
Identifying and classifying the orbits is of significant importance in understanding and stabilizing the plasma.


Multiple connectable groups of amino acids can be constructed for the interacting proteins, with probabilities giving the lik
eli
hood for
each variant. Finding the "optimal" solution is important. For example, high scoring interfaces may represent a dynamic pictu
re
of the
protein machine workings, or additional "ports" suitable for yet
-
not
-
discovered protein subunits and other co
-
factors.


U.S. Department of Energy

Office of Science

February 19, 2006

9

Office of Science

Decadal Data Challenge

Mathematical and Computational Challenges and Needs




“Curse of Dimensionality”
-

Interpretation of high dimensional data




Challenges:




Going beyond classical Bayesian theory of probabilistic quantification to address long range
and non
-
linear correlations between features in noisy data



Mathematical description of complex geometric shapes in their spatial and temporal
dimensions



Enumeration and optimization of multivariate functions on complex graphs that describe
relationships between identified features



Low rank approximations and generalized separation of variables to reduce the


dimension with out destroying information



New harmonic and discrete mathematics and new algorithms for fast extraction of
correlations and patterns


U.S. Department of Energy

Office of Science

February 19, 2006

10

Office of Science Response to
the Data Challenge


The Office of Science will initiate a long
-
term research program to address the “Curse
of Dimensionality.” Some of the elements of the research program are:



Bayesian Theory


New research to develop efficient ways for dealing with both local and long
-
range
correlations between features, including Bayesian estimators to correctly estimate the simultaneous
appearance of “striking” features at precisely defined locations, and mechanisms to incorporate partial
analytical models to supplement missing statistics.



Mathematical description of complex geometric shapes



New research on the stochastic theory
of shapes to classify geometric shapes in terms of stochastic models, which are essential for the
rigorous comparisons needed for pattern discovery. We intend to develop high performance scalable
algorithms for querying, searching, tracking, and reconstruction of high dimensional shapes from
incomplete information.



Enumeration and optimization of multivariate functions on complex graphs


New research to
develop efficient methodologies for the hierarchical enumeration of composite objects, including
analytical methods for dynamically constraining the search space. We intend to develop optimization
methods to deal with novel spaces formed by graphs of identified features (dots) and their
relationships (connections). Such spaces typically have hundreds of variables and dimensions.
Additionally, we intend to develop computational libraries to efficiently handle an enormous number of
possible variants through construction of subgraph indexing schemes and efficient lookup methods.