U.S. Department of Energy Office of Science
Download
Report
Transcript U.S. Department of Energy Office of Science
U.S. Department of Energy
Office of Science
U.S. Department of Energy’s
Office of Science
New Opportunities for Data and
Information Management:
Finding the Dots, Connecting the Dots,
Understanding the Dots
2006 AAAS Annual Meeting
February 19, 2006
St. Louis, MO
Raymond L. Orbach
Director,
Office of Science
U.S. Department of Energy
DOE Office of Science
Office of Science
Supports basic research that underpins DOE
missions
Constructs and operates large scientific
facilities for the U.S. scientific community
Seven Program Offices
February 19, 2006
Accelerators, synchrotron light sources, neutron sources
Advanced Scientific Computing Research (ASCR)
Basic Energy Sciences (BES)
Biological and Environmental Research (BER)
Fusion Energy Sciences (FES)
High Energy Physics (HEP)
Nuclear Physics (NP)
Workforce Development (WD)
2
U.S. Department of Energy
The FY 2007 President’s Request for science
funding is a 14.1% increase and sets the Office of
Science on a path to doubling by 2016
Office of Science
Office of Science Budget
Doubling from FY 2006 to FY 2016
7
6
Budget Authority
As Spent Dollars in Billions
An historic
opportunity for our
country – a
renaissance for
U.S. science and
continued global
competitiveness.
SC budget doubles
to $7.2B in FY 2016
from $3.6B in FY 2006
5
4
FY 1995 level
plus inflation
3
2
1
2009
2010
2011
2012
2013
2014
2015
2016
2003
2004
2005
2006
2007
2008
1997
1998
1999
2000
2001
2002
1995
1996
0
Fiscal Year
February 19, 2006
3
U.S. Department of Energy
Data Storage Funding
Office of Science
Data Storage Funding
Including R&D
(ASCR+HEP+NP)
FY 2006
FY 2007
$ 34M
$ 37.6M
Current experiment and simulation data storage capacity
for the Office of Science is about 100 petabytes and is
expected to more than double by FY 2009
February 19, 2006
4
U.S. Department of Energy
Data Sources
Three Pillars of Scientific Discovery:
Experiment, Theory, and Simulation
Office of Science
Two different kinds of very large data sets:
February 19, 2006
Experimental data
High energy physics, environment and climate
observation data, biological mass-spectrometry
Data needs to be retained for long term
Simulation data
Astrophysics, climate, fusion, catalysis, QCD
From computationally expensive large simulations
Post processing of data using quantum Monte
Carlo, analytics and graphical analysis,
perturbation theory, and molecular dynamics
5
U.S. Department of Energy
PetaCache Project
HEP Data Analysis: Beyond Data Mining
Office of Science
BaBar Data Challenge:
• 2 petabytes stored, 10-100 terabytes intense access/inquiry
• 1–15 kilobytes (small) data objects
• Hundreds of users, thousands of batch jobs
PetaCache project (SLAC: David Leith and Richard Mount)
Revolutionize access to huge datasets:
• First innovative solid-state disk as intermediate storage for HPC
data searches
• 100 times smaller latency than disk
• At least 500 times faster throughput than disk
• Builds Feature Database structures to accelerate the retrieval of
data
Expected Impact
BaBar: From analyst’s idea to seeing the result – nine months
becomes one day.
February 19, 2006
6
U.S. Department of Energy
Connecting the Dots in Science
ORNL: Nagiza Samatova
Office of Science
Finding the Dots
Sheer Volume of Data
Climate
Now: 20-40 Terabytes/year
5 years: 5-10 Petabytes/year
Fusion
Now: 100 Megabytes/15 min
5 years: 1000 Megabytes/2 min
February 19, 2006
Understanding the Dots
Advanced Mathematics
and Algorithms
Huge dimensional space
Combinatorial challenge
Complicated by noisy data
Requires high-performance
computers
Providing Predictive
Understanding
Produce hydrogen-based energy
Stabilize carbon dioxide
Clean and dispose toxic
waste
7
U.S. Department of Energy
Connecting the Dots in Combustion,
Fusion, and Structural Biology
Office of Science
Finding the DOTS - Large-scale simulations in support of combustion grand challenges are generating terabytes of data per
simulation. Of particular interest in these simulations are transient events such as ignition, extinction, and re-ignition, which are not
well understood. Similar problems also exist in high-resolution, ultra-high speed images of edge turbulence in the National Spherical
Torus Experiment at PPPL. In structural biology, the interaction between two proteins forming a molecular machine can be described
as the set of contacting amino acid residues. The set of features is very large, and is generated by the combinations of different
chemical identities, orientation patterns, and spatial arrangement of the residues.
Connecting the DOTS – In combustion, it is unclear what features in the simulation data and their nonlinear dynamic effects could be
used to characterize such events. Simulations need to be carried out to explore different possibilities. In fusion, extracting features
that could characterize the plasma blobs is relevant to the analysis of Poincaré sections for the particle orbits. For the two interacting
proteins, the number of the distinctly different variants of subunits forming the molecular machine is millions or billions, even after
applying sophisticated filtering algorithms. The correlations between the subunits establishes the connection between the dots.
Understanding the DOTS –
• A complete understanding the correlations and chemical reactions inherent in the turbulent flow during combustion is still beyond our
reach.
• In fusion, each particle orbit in a Poincaré section is generated when a particle intersects a plane perpendicular to the magnetic axis.
Identifying and classifying the orbits is of significant importance in understanding and stabilizing the plasma.
• Multiple connectable groups of amino acids can be constructed for the interacting proteins, with probabilities giving the likelihood for
each variant. Finding the "optimal" solution is important. For example, high scoring interfaces may represent a dynamic picture of the
protein machine workings, or additional "ports" suitable for yet-not-discovered protein subunits and other co-factors.
February 19, 2006
8
U.S. Department of Energy
Office of Science
Decadal Data Challenge
Office of Science
Mathematical and Computational Challenges and Needs
“Curse of Dimensionality” - Interpretation of high dimensional data
Challenges:
Going beyond classical Bayesian theory of probabilistic quantification to address long range
and non-linear correlations between features in noisy data
Mathematical description of complex geometric shapes in their spatial and temporal
dimensions
Enumeration and optimization of multivariate functions on complex graphs that describe
relationships between identified features
Low rank approximations and generalized separation of variables to reduce the
dimension with out destroying information
New harmonic and discrete mathematics and new algorithms for fast extraction of
correlations and patterns
February 19, 2006
9
U.S. Department of Energy
Office of Science Response to
the Data Challenge
Office of Science
The Office of Science will initiate a long-term research program to address the “Curse
of Dimensionality.” Some of the elements of the research program are:
Bayesian Theory – New research to develop efficient ways for dealing with both local and long-range
correlations between features, including Bayesian estimators to correctly estimate the simultaneous
appearance of “striking” features at precisely defined locations, and mechanisms to incorporate partial
analytical models to supplement missing statistics.
Mathematical description of complex geometric shapes – New research on the stochastic theory
of shapes to classify geometric shapes in terms of stochastic models, which are essential for the
rigorous comparisons needed for pattern discovery. We intend to develop high performance scalable
algorithms for querying, searching, tracking, and reconstruction of high dimensional shapes from
incomplete information.
Enumeration and optimization of multivariate functions on complex graphs – New research to
develop efficient methodologies for the hierarchical enumeration of composite objects, including
analytical methods for dynamically constraining the search space. We intend to develop optimization
methods to deal with novel spaces formed by graphs of identified features (dots) and their
relationships (connections). Such spaces typically have hundreds of variables and dimensions.
Additionally, we intend to develop computational libraries to efficiently handle an enormous number of
possible variants through construction of subgraph indexing schemes and efficient lookup methods.
February 19, 2006
10