Discovery Systems: Accelerating Scientific Discovery at NASA

Download Report

Transcript Discovery Systems: Accelerating Scientific Discovery at NASA

Discovery Systems: Accelerating
Scientific Discovery at NASA
Barney Pell, Ph.D.
NASA Ames Research Center
Barney.D.Pell @@ nasa.gov
Presentation at IAAI-04 panel on The Broader Role of
Artificial Intelligence in Large-Scale Scientific Research
Outline of Talk
• Trends and Challenges affecting Scientific Discovery
at NASA
• Distributed Data Search, Access, and Analysis
• Machine-Assisted Model Discovery and Refinement
• Exploratory Environments and Collaboration
• Vision for the future and summary of AI technologies
• Closing remarks
Science Discovery Acceleration
• NASA conducts missions to take measurements that produce large
amounts of data to support ambitious science goals
– In-situ observation of deep space for origin and evolution of life
– Earth-orbiting satellites for global cause and effect relationships
– Biological experiments to support life in space
• Too much work and expertise required to perform each of many
steps in a discovery cycle to understand this data
– Detailed knowledge of the heritage of data and models
– Hard to invert through a complex processing pipeline
– Constant reprocessing and reanalyzing as new info available
• The specialized expertise slows the process and also
restricts the set of users and scientists using NASA
products
Discovery Steps and Architectures
•
Examples of discovery steps
-
finding and organizing distributed data
assessing, filtering, cleaning and post-processing the data
reconciling the differences across diverse data
exploring the data sets to discover regularities
using the regularities to formulate and evaluate hypotheses
testing the hypotheses and comparing alternate hypotheses against each other
integrating the data into models
linking separate models together
running simulations to generate predictive data to compare against observations
• Current technology programs addressing difficulties of individual
steps, typically in isolation
– Eg. machine-learning algorithms detect regularities in underlying
phenomena but also artifacts of the data collection/processing system.
• ML algorithms developed without consideration of the deeper processes by
which the data is generated, distributed, and used
• Data system put together without characterizing the data stream to enable
new users to analyze the data in unanticipated ways.
Trends affecting NASA
• Improvements in sensors, communications, and computing
– orders of magnitude more data, in more varieties, and at higher rates
than ever before.
• NASA’s science questions are becoming increasingly large-scale
and interdisciplinary.
– forming and evaluating theories across a wide variety of data
– integrating a complex set of models produced by diverse communities
of scientists
– virtual projects comprising distributed teams
• Socioeconomic demands are requiring increased quality
– Eg. many customers for weather and climate model predictions
– Need characterization of confidence in data, models, results
• Faster feedback loops in observing/simulation systems
– make it possible to gather more precise data, often in real-time, if only
we could understand the existing data quickly enough.
• NASA required to enable public access and benefit from the data to
the same extent as the mission science team
Distributed Search, Access and
Analysis
• Objective
– Develop and demonstrate technologies to enable investigating
interdisciplinary science questions by finding, integrating, and
composing models and data from distributed archives, pipelines;
running simulations, and running instruments.
– Support interactive and complex query-formulation with
constraints and goals in the queries; and resource-efficient
intelligent execution of these tasks in a resource-constrained
environment.
– Milestone: Enable novel what-if and predictive question
answering
•
•
•
•
•
Across NASA’s complex and heterogeneous data and simulations
By non data-specialists
Use world-knowledge and meta-data
Support query formulation and resource discovery
Example query: “Within 20%, what will be the water runoff in the
creeks of the Comanche National Grassland if we seed the clouds
over southern Colorado in July and August next year?”
Years-To-Centuries
Chemistry
CO2, CH4, N2O
ozone, aerosols
Climate
Temperature, Precipitation,
Radiation, Humidity, Wind
Heat
Moisture
Momentum
CO2 CH4
N2O VOCs
Dust
Biogeochemistry
Carbon Assimilation
Decomposition
Mineralization
Aerodynamics
Energy
Water
Biogeophysics
Microclimate
Canopy Physiology
Phenology
Evaporation
Transpiration
Snow Melt
Infiltration
Runoff
Intercepted
Water
Snow
Hydrology
Soil
Water
Days-To-Weeks
Minutes-To-Hours
Terrestrial Biogeoscience Involves Many Complex Processes and Data
Bud Break
Leaf Senescence
Gross Primary
Production
Plant Respiration
Microbial Respiration
Nutrient Availability
Species Composition
Ecosystem Structure
Nutrient Availability
Water
Watersheds
Surface Water
Subsurface Water
Geomorphology
Hydrologic
Cycle
Ecosystems
Species Composition
Ecosystem Structure
Vegetation
Dynamics
(Courtesy Tim Killeen and Gordon Bonan, NCAR)
Disturbance
Fires
Hurricanes
Ice Storms
Windthrows
Solution Construction via Composing Models
modeled
phenomenon
evaporation
model
runoff model
snow melt
metadata
data
preparation
surface water
community
snow coverage
snow and ice
DAAC (NASA)
service interface:
required inputs,
provided outputs,
data descriptions,
events
binary data streams
climate model
Each model typically has a
community of experts that
deal with the complexity of the
model and its environment
parameterized
phenomenon
rainfall
Nat. Weather
Service
evaporati
evaporati
runoff mo
runoff mo
topography
USGS
snow melt
metadata
data
preper
data
preper
modeled surface water
phenomenon community
modeled
phenomenon
snow melt
metadata
surface water
community
Virtual Data Grid Example
Application: Three data types of interest:  is derived from ,  is derived from a, which is primary data
(interaction and and operations proceed left to right)
Need 
Need 
 is known. Contact
Materialized Data
Catalogue.
Metadata
Catalogue
Need 
Have 
Proceed?
Need 
How to generate 
( is at LFN)
Estimate for
generating 
Abstract Planner
(for materializing data)
Need 
Request 
LFN for 
Concrete Planner
(generates workflow)
Notify
that 
exists
PERS
requires 
Materialize 
with PERS
Need to
materialize 
Materialized Data Catalogue
LFN = logical file name
PFN = physical file name
PERS = prescription for
generating unmaterialized data
As illustrated, easy to deadlock w/o QoS and SLAs.
Exact steps to
Resolve
generate 
LFN
Grid workflow
PFN
 is
engine
materialized
at LFN
 data
and LFN
Virtual Data Catalogue
(how to generate  and
)
Inform that 
is materialized
Grid storage
resources
Grid compute
resources
Data Grid replica
services
Store an archival copy,
if so requested.
Record existence of
cached copies.
Machine assisted model discovery
and refinement
• Develop and demonstrate methods to
– assist discovery of and fit physically descriptive models with
quantifiable uncertainty for estimation and prediction
– improve the use of observational or experimental data for
simulation and assimilation applied to distributed instrument
systems (e.g. sensor web)
– integrate instrument models with physical domain modeling and
with other instruments (fusion) to quantify error, correct for
noise, improve estimates and instrument performance.
• Eg. Metrics
– 50% reduction in scientist time forming models
– 10% reduction in uncertainty in parameter estimates or a 10%
reduction in effort to achieve current accuracies
– 10% reduction in computational costs associated with a forward
model
– ability to process data on the order of 1000s of dimensions
– ability to estimate parameters from tera-scale data.
Prediction of the 97/98 El Nino
JFM
1998
Predicted
Precipitation
1997
1999
A reasonable 15 month prediction of the 97/98 El Nino is
achieved when ocean height, temperature and surface wind
data are combined to initialize the model.
•Partners
Observing
System of the
Future
•
•
•
•
•
NASA
DoD
Other Govt
Commercial
International
•Advanced Sensors
• Information
Synthesis
• Access to
Knowledge
•Sensor Web
Information
User
Community
Exploratory Environments and
Collaboration
• Objective
– Develop exploratory environments in which
interdisciplinary and/or distributed teams visualize
and interact with intelligently combined and
presented data from such sources as distributed
archives, pipelines, simulations, and instruments in
networked environments.
– Demonstrate that these environments measurably
improve scientists’ capability to answer questions,
evaluate models, and formulate follow-on questions
and predictions.
Multi-parameter Explorations
Vision for future science
Technical Area
Today
Tomorrow
Distributed Data
Search Access and
Analysis
Answering queries requires
specialized knowledge of content,
location, and configuration of all
relevant data and model resources.
Solution construction is manual.
Search queries based on high-level
requirements. Solution construction is
mostly automated and accessible to
users who aren’t specialists in all
elements.
Machine integration
of data / QA
Publish a new resource takes 1-3
years. Assembling a consistent
heterogeneous dataset takes 1-3
years. Automated data quality
assessment by limits and rules.
Publish a new resource takes 1 week.
Assembling a consistent heterogeneous
dataset in real-time. Automated data
quality assessment by world models and
cross-validation.
Machine Assisted
Model Discovery
and Refinement
Physical models have hidden
assumptions and legacy restrictions.
Machine learning algorithms are
separate from simulations, instrument
models, and data manipulation codes.
Prediction and estimation systems
integrate models of the data collection
instruments, simulation models,
observational data formatting and
conditioning capabilities. Predictions and
estimates with known certainties.
Exploratory
environments and
collaboration
Co-located interdisciplinary teams
jointly visualize multi-dimensional
preprocessed data or ensembles of
running simulations on wall-sized
matrixed displays.
Distributed teams visualize and interact
with intelligently combined and presented
data from such sources as distributed
archives, pipelines, simulations, and
instruments in networked environments.
Discovery Systems: AI Technology Elements
– Distributed data search, access and analysis
•
•
•
•
•
Grid based computing and services
Information retrieval
Databases
Planning, execution, agent architecture, multi-agent systems
Knowledge representation and ontologies
– Machine-assisted model discovery and refinement
• Information and data fusion
• Data mining and Machine learning
• Modeling and simulation languages
– Exploratory environments and Collaboration
•
•
•
•
Visualization
Human-computer interaction
Computer-supported collaborative work
Cognitive models of science
Closing remarks
• NASA science is challenging
• Need to improve in existing capabilities and address
emerging trends
• AI technologies have a crucial role for future science
– Distributed Data Search, Access, and Analysis
– Machine-Assisted Model Discovery and Refinement
– Exploratory Environments and Collaboration
• Many of these themes are shared with science (or
research) at large