Challenges in the Computational Discovery of Scientific Knowledge

Download Report

Transcript Challenges in the Computational Discovery of Scientific Knowledge

Challenges in the Computational
Discovery of Scientific Knowledge
Pat Langley
Center for the Study of Language and Information
Stanford University, Stanford, California
http://cll.stanford.edu/~langley
[email protected]
Thanks to K. Arrigo, S. Bay, L. Chrisman, D. George, A. Pohorille, C. Potter, J. Sanchez,
K. Saito, and J. Shrager.
The Challenge of Systems Science
Disciplines like Earth science and computational biology differ
from traditional fields in that they:
 focus on synthesis rather than analysis in their operation;
 rely on computer modeling as one of their central methods;
 develop system-level models with many variables and relations;
 evaluate their models on observational, not experimental, data.
Developing and testing such models are complex tasks that would
benefit from computational aids.
Our research goal is to design, construct, evaluate, and understand
such computational tools for systems science.
The Data Mining Paradigm
One approach to computational discovery, known as data mining:
 emphasizes the availability of vast amounts of data;
 focuses on business data, with some scientific applications;
 uses formalisms like decision trees, association rules, and
Bayesian networks to encode learned knowledge.
I.e., data mining researchers favor their own formalisms over those
used by scientists and engineers.
As a result, their discoveries are seldom communicable to members
of those communities.
Computational Scientific Discovery
An older paradigm, computational scientific discovery, instead:
 emphasizes use of heuristic search in the discovery process;
 focuses on discovery of knowledge in scientific domains;
 uses formalisms like numeric equations, structural models, and
reaction pathways to describe regularities.
I.e., researchers in this framework favor representations used by
scientists and engineers.
As a result, their systems’ discoveries are usually communicable
to members of those communities.
Successes of Computational Scientific Discovery
Over the past decade, systems of this type have helped discover
new knowledge in many scientific fields:
 qualitative chemical factors in mutagenesis (King et al., 1996)
 quantitative laws of metallic behavior (Sleeman et al., 1997)
 qualitative conjectures in number theory (Colton et al., 2000)
 temporal laws of ecological behavior (Todorovski et al., 2000)
 reaction pathways in catalytic chemistry (Valdes-Perez, 1994)
Each has led to publications in the refereed scientific literature
(e.g., Langley, 2000), but they did not focus on systems science.
Two Discovery Problems in Systems Science
Given
Find
Data on climate and
organism variables
over space and time
An ecosystem model
that fits these data
and explains them
Given
Find
Time-series data on
gene expressions for
specific organisms
A model of gene
regulation that fits and
explains these data
These problems raise new challenges that require advances in our
methods for computational discovery.
Challenge 1: Representing Scientific Models
To assist system scientists’ modeling efforts, we must first encode
candidate models that:
 address observational rather than experimental data;
 deal with dynamic systems that change over time;
 have an explanatory rather than a descriptive character;
 are causal in that they describe chains of effects;
 contain quantitative relations and qualitative structure.
We need some formal way to represent such models that can be
interpreted computationally.
Why Are Existing Formalisms Inadequate?
regression trees
B>6
C>0
14.3
C>4
18.7
11.5
16.9
hidden Markov models
0.7
x=16,x=2
y=13,x=1
Horn clause programs
1.0
x=12,x=1
y=18,x=2
x=19,x=1
y=11,x=2
0.3
x=12,x=1
y=10,x=2
1.0
gcd(X,X,X).
gcd(X,Y,D) :- X<Y,Z is Y–X,gcd(X,Z,D).
gcd(X,Y,D) :- Y<X,gcd(Y,X,D).
systems of equations
d[ice_mass,t] =  (18  heat) / 6.02
d[water_mass,t] = (18  heat) / 6.02
A Process Model for an Aquatic Ecosystem
model AquaticEcosystem
variables: phyto, zoo, nitro, residue
observables: phyto, nitro
process phyto_exponential_decay
equations: d[phyto,t,1] =  0.307  phyto
d[residue,t,1] = 0.307  phyto
process zoo_exponential_decay
equations: d[zoo,t,1] =  0.251  zoo
d[residue,t,1] = 0.251
process zoo_phyto_predation
equations: d[zoo,t,1] = 0.615  0.495  zoo
d[residue,t,1] = 0.385  0.495  zoo
d[phyto,t,1] =  0.495  zoo
process nitro_uptake
conditions: nitro > 0
equations: d[phyto,t,1] = 0.411  phyto
d[nitro,t,1] =  0.098  0.411  phyto
process nitro_remineralization;
equations: d[nitro,t,1] = 0.005  residue
d[residue,t,1 ] =  0.005  residue
Advantages of Quantitative Process Models
Process models are a good target for discovery systems because:
 they embed quantitative relations within qualitative structure;
 that refer to notations and mechanisms familiar to scientists;
 they provide dynamical predictions of changes over time;
 they offer causal and explanatory accounts of phenomena;
 while retaining the modularity needed to support induction.
Quantitative process models provide an important alternative to
formalisms used currently in computational discovery.
Challenge 2: Making Predictions from Models
To utilize or evaluate a given process model, we must simulate its
behavior over time:
 specify initial values for input variables and time step size;
 on each time step, determine which processes are active;
 solve active algebraic/differential equations with known values;
 propagate values and recursively solve other active equations;
 when multiple processes influence the same variable, assume
their effects are additive.
This performance method makes specific predictions that we can
compare to observations.
Predictions from the Ecosystem Model
Challenge 3: Encoding Background Knowledge
To constrain candidate models, we can utilize available backround
knowledge about the domain.
Previous work has cast background knowledge in terms of:
 Horn clause programs (e.g., Towell & Shavlik, 1990)
 context-free grammars (e.g., Dzeroski & Todorovski, 1997)
 prior probability distributions (e.g., Friedman et al., 2000)
However, none of these notations are familiar to domain scientists,
which suggests the need for another approach.
Generic Processes as Background Knowledge
Our framework casts background knowledge as generic processes
that specify:
 the variables involved in a process and their types;
 the parameters appearing in a process and their ranges;
 the forms of conditions on the process; and
 the forms of associated equations and their parameters.
Generic processes are building blocks from which one can compose
a specific process model.
Generic Processes for Aquatic Ecosystems
generic process exponential_decay
variables: S{species}, D{detritus}
parameters:  [0, 1]
equations: d[S,t,1] = 1    S
d[D,t,1] =   S
generic process remineralization
variables: N{nutrient}, D{detritus}
parameters:  [0, 1]
equations: d[N, t,1] =   D
d[D, t,1] = 1    D
generic process predation
variables: S1{species}, S2{species}, D{detritus}
parameters:  [0, 1],  [0, 1]
equations: d[S1,t,1] =     S1
d[D,t,1] = (1  )    S1
d[S2,t,1] = 1    S1
generic process constant_inflow
variables: N{nutrient}
parameters:  [0, 1]
equations: d[N,t,1] = 
generic process nutrient_uptake
variables: S{species}, N{nutrient}
parameters:  [0, ],  [0, 1],  [0, 1]
conditions: N > 
equations: d[S,t,1] =   S
d[N,t,1] = 1      S
Challenge 4: Inducing Process Models
training data
process model
model AquaticEcosystem
variables: nitro, phyto, zoo, nutrient_nitro, nutrient_phyto
observables: nitro, phyto, zoo
process phyto_exponential_growth
equations: d[phyto,t] = 0.1  phyto
process zoo_logistic_growth
equations: d[zoo,t] = 0.1  zoo / (1  zoo / 1.5)
Induction
process exponential_growth
variables: P {population}
equations: d[P,t] = [0, 1,]  P
process logistic_growth
variables: P {population}
equations: d[P,t] = [0, 1, ]  P  (1  P / [0, 1, ])
process constant_inflow
variables: I {inorganic_nutrient}
equations: d[I,t] = [0, 1, ]
process consumption
variables: P1 {population}, P2 {population}, nutrient_P2
equations: d[P1,t] = [0, 1, ]  P1  nutrient_P2,
d[P2,t] =  [0, 1, ]  P1  nutrient_P2
process no_saturation
variables: P {number}, nutrient_P {number}
equations: nutrient_P = P
process saturation
variables: P {number}, nutrient_P {number}
equations: nutrient_P = P / (P + [0, 1, ])
generic processes
process phyto_nitro_consumption
equations: d[nitro,t] = 1  phyto  nutrient_nitro,
d[phyto,t] = 1  phyto  nutrient_nitro
process phyto_nitro_no_saturation
equations: nutrient_nitro = nitro
process zoo_phyto_consumption
equations: d[phyto,t] = 1  zoo  nutrient_phyto,
d[zoo,t] = 1  zoo  nutrient_phyto
process zoo_phyto_saturation
equations: nutrient_phyto = phyto / (phyto + 0.5)
A Method for Process Model Induction
The IPM algorithm that constructs process models from generic
components in four stages:
1. Find all ways to instantiate known generic processes with
specific variables, subject to type constraints;
2. Combine instantiated processes into candidate generic models
that specify explanatory structures, with limits on the total
number of processes.
3. For each generic model, carry out gradient descent search
through parameter space to find good parameter values;
4. Return the parameterized model with the lowest description
length: Md = (Mv + Mc )  log (n) + n  log (Me ) .
Evaluation of the IPM Algorithm
To demonstrate IPM's ability to induce process models, we ran it
on synthetic data for a known system.
1. We used the aquatic ecosystem model to generate data on 500
time steps for the variables nitro and phyto;
2. We replaced each ‘true’ value x with x  (1 + r  0.05), where
r came from a Gaussian distribution ( = 0 and  = 1);
3. We ran IPM on these noisy data, giving it type constraints and
generic processes as background knowledge.
The IPM algorithm examined a space of 256 generic models,
each with an embedded parameter optimization.
Predictions from Induced Ecosystem Model
Issues in Process Model Induction
Inductive process modeling raises a number of issues that have
clear analogues in other paradigms:
 identifying conditions on processes (parameter optimization)
 inferring initial values of unobservables (parameter optimization)
 keeping the search space tractable (typing on variables)
 reducing variance to mitigate overfitting (min. desc. length)
We have demonstrated promising responses to these four problems
within the IPM framework.
Best Model Fit to Data from Ross Sea
Best Model Fit to Data on Protozoan Predation
Collecting Data on Photosynthetic Processes
www.affymetrix.com/
/wwwscience.murdoch.edu.au/teach
Microarray
Trace
Continuous Culture (Chemostat)
Health of Culture
External stimuli (e.g., light)
Adaptation Period
Sampling mRNA/cDNA
Equlibrium Period
www.affymetrix.com/
Time
Gene Expressions for Cyanobacteria
Generic Processes for Photosynthesis Regulation
generic process translation
variables: P{protein}, M{mRNA}
parameters:  [0, 1]
equations: d[P,t,1] =   M
generic process transcription
variables: M{mRNA}, R{rate}
parameters:
equations: d[M,t,1] = R
generic process regulate_one
variables: R{rate}, S{signal}
parameters:  [1 , 1]
equations: R =   S
generic process regulate_two
variables: R{rate}, S{signal}
parameters:  [1 , 1],  [0, 1]
equations: R =   S
d[S, t,1] = 1    S
generic process automatic_degradation
variables: C{concentration}
conditions: C > 0
parameters:  [0, 1]
equations: d[C,t,1] = 1    C
generic process controlled_degradation
variables: D{concentration}, E{concentration}
conditions: D > 0, E > 0
parameters:  [0, 1]
equations: d[D,t,1] = 1    E
d[E,t,1] = 1    E
generic process photosynthesis
variables: L{light}, P{protein}, R{redox}, S{ROS}
parameters:  [0, 1],  [0, 1]
equations: d[R,t,1] =   L  P
d[S,t,1] =   L  P
A Process Model for Photosynthetic Regulation
model photo_regulation
variables: light, mRNA_protein, ROS, redox, transcription_rate
observables: light, mRNA
process photosynthesis;
equations: d[redox,t,1] = 0.0155  light  protein
d[ROS,t,1] = 0.019  light  protein
process protein_translation
equations: d[protein,t,1] = 7.54  mRNA
process mRNA_transcription
equations: d[mRNA,t,1] = transcription_rate
process regulate_one_1
equations: transcription_rate = 0.99  light
process regulate_two_2
equations: transcription_rate = 1.203  redox
d[redox,t,1] =  0.0002  redox
process automatic_degradation_1
conditions: protein > 0
equations: d[protein,t,1] =  1.91  protein
process controlled_degradation_1
conditions:redox > 0, ROS > 0
equations: d[redox,t,1] =  0.0003  ROS
d[ROS,t,1] =  0.0003  ROS
Predictions from Best Parameterized Model
Electric Power on the International Space Station
Telemetry Data from Space Station Batteries
Induced Process Model for Battery Behavior
model Battery
variables: Rs, Vcb, soc , Vt, i, temperature
observable: soc, Vt, i, temperature
process voltage_charge
conditions: i  0
equations: Vt = Vcb + 6.105  Rs  i
process charge_transfer
equations: d[soc,t,1] = i  Vcb/179.38
process quadratic_influence_Vcb_soc
equations: Vcb = 41.32  soc  soc
process linear_influence_Vcb_temp
equations: Vcb = 0.2592  temperature
process linear_influence_Rs_soc
equations: Rs = 0.03894  soc
process voltage_discharge
conditions: i < 0
equations: Vt = Vcb  1.0 / (Rs + 1.0)
Results on Battery Test Data
Steps in Applying Computational Scientific Discovery
problem
formulation
algorithm
manipulation
algorithm
invocation
representation
engineering
data collection/
manipulation
filtering and
interpretation
Challenge 5: Interfacing with Scientists
Because scientists do not want to be replaced, we are developing
an interactive environment that lets users:
 specify a quantitative process model of the target system;
 display and edit the model’s structure and details graphically;
 simulate the model’s behavior over time and situations;
 compare the model’s predicted behavior to observations;
 invoke a revision module in response to detected anomalies.
The environment offers computational assistance in forming and
evaluating models but lets the user retain control.
Viewing and Editing a Process Model
Results of Revising the NPP Model
Initial model:
E = 0.56 · T1 · T2 · W
T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10) ) · (1 + e 0.3 · (Tempc – Topt – 10) )]
PET = 1.6 · (10 · Tempc / AHI)A · PET-TW-M
SR  {3.06, 4.35, 4.35, 4.05, 5.09, 3.06, 4.05, 4.05, 4.05, 5.09, 4.05}
RMSE on training data = 465.212 and r 2 = 0.799
Revised model:
• E = 0.353 · T10.00 · T2 0.08 · W 0.00
• T2 = 0.83 / [(1 + e 1.0 · (Topt – Tempc – 6.34) ) · (1 + e 1.0 · (Tempc – Topt – 11.52) )]
PET = 1.6 · (10 · Tempc / AHI) A · PET-TW-M
• SR  {0.61, 3.99, 2.44, 10.0, 2.21, 2.13, 2.04, 0.43, 1.35, 1.85, 1.61}
Cross-validated RMSE = 397.306 and r 2 = 0.853 [ 15 % reduction ]
Intellectual Influences
Our approach to computational discovery incorporates ideas from
many traditions:
 computational scientific discovery (e.g., Langley et al., 1983);
 theory revision in machine learning (e.g., Towell, 1991);
 qualitative physics and simulation (e.g., Forbus, 1984);
 languages for scientific simulation (e.g., STELLA, MATLAB);
 interactive tools for data analysis (e.g., Schneiderman, 2001).
Our work combines, in novel ways, insights from machine learning,
AI, programming languages, and human-computer interaction.
Contributions of the Research
In summary, our work on computational scientific discovery has, in
responding to five challenges, produced:
 a new formalism for representing scientific process models;
 a computational method for simulating these models’ behavior;
 an encoding for background knowledge as generic processes;
 an algorithm for inducing process models from time-series data;
 an interactive environment for model construction/utilization.
We have demonstrated this approach to model construction on four
domains from Earth science, microbiology, and engineering.
Directions for Future Research
Despite our progress to date, we need further work in order to:
 produce additional results on other scientific data sets
 develop more robust methods for fitting model parameters
 extend the approach to handle data sets with missing values
 implement heuristic methods for searching the model space
 utilize knowledge of subsystems to further constrain search
 augment the modeling environment to make it more usable
Inductive process modeling has great potential to speed progress
in system science and engineering.
End of Presentation
In Memoriam
Two years ago, computational scientific discovery lost two of its
founding fathers:
 Herbert A. Simon (1916 – 2001)
 Jan M. Zytkow (1945 – 2001)
Both contributed to the field in many ways: posing new problems,
inventing methods, training students, and organizing meetings.
Moreover, both were interdisciplinary researchers who contributed
to computer science, psychology, philosophy, and statistics.
Herb Simon and Jan Zytkow were excellent role models that we
should all aim to emulate.
Data Mining vs. Scientific Discovery
There exist two computational paradigms for discovering explicit
knowledge from data:
 Data mining generates knowledge cast as decision trees,
logical rules, or other notations invented by AI researchers;
 Computational scientific discovery instead uses equations,
structural models, reaction pathways, or other formalisms
invented by scientists and engineers.
Both approaches draw on heuristic search to find regularities in
data, but they differ considerably in their emphases.
Time Line for Research on
Computational Scientific Discovery
1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
Abacus,
Coper
Bacon.1–Bacon.5
AM
Glauber
Dendral
Dalton,
Stahl
Numeric laws
Hume,
ARC
DST, GPN
LaGrange
IDSQ,
Live
NGlauber
Stahlp,
Revolver
IE
Legend
Fahrehneit, E*,
Tetrad, IDSN
Gell-Mann
BR-3,
Mendel
RL, Progol
Pauli
Coast, Phineas,
AbE, Kekada
Qualitative laws
SDS
HR
BR-4
Mechem, CDP
Structural models
SSF, RF5,
LaGramge
Process models
Astra,
GPM
Why Are Process Models Interesting?
Process models are a crucial target for machine learning because:
 they incorporate scientific formalisms rather than AI notations;
 that are easily communicable to scientists and engineers;
 they move beyond descriptive generalization to explanation;
 while retaining the modularity needed to support induction.
These reasons point to process models as an ideal representation
for scientific and engineering knowledge.
Process models are an important alternative to formalisms used
currently in machine learning.
Challenges of Inductive Process Modeling
Process model induction differs from typical learning tasks in that:
 process models characterize behavior of dynamical systems;
 variables are mainly continuous and data are unsupervised;
 observations are not independently and identically distributed;
 process models contain unobservable processes and variables;
 multiple processes can interact to produce complex behavior.
Compensating factors include a focus on deterministic systems and
the availability of background knowledge.
Making Predictions with Process Models
Specify initial values for input variables
and the size for time steps
On each time step, check conditions to
decide which processes are active
Solve algebraic and differential
equations with known values
Propagate values and recurse
to solve other equations
Add the effects of different
processes on each variable
Predictions from IPM’s Induced Model
Best Model Fit to Actual Nitrate Data
Best Model Fit to Actual Phytoplankton Data
Inductive Process Modeling
training data
Observed values for
a set of continuous
variables as they vary
over time or situations
learned model
Induction
Generic processes that
characterize causal
relationships among
variables in terms of
conditional equations
background knowledge
A specific process
model that explains
the observed values
and predicts future
data accurately
Inductive Process Modeling as Search
To construct a quantitative process model, we need an algorithm to
search the space of models that assumes:





an initial state from which to start search;
some operators that generate new states;
an evaluation function that selects among states;
an overall control regime for the search; and
a halting criterion for ending the search.
We have implemented a four-stage method that takes positions on
these design decisions.
The IPM Method for Process Model Induction
Find all ways to instantiate known generic
processes with specific variables
Combine subsets of instantiated
processes into generic models
Remove candidates that are too
complex or not connected graphs
For each generic model, search
for good parameter values
Return parameterized model
with the smallest error
A Biologist’s Depiction of Photosynthesis
http://www.bio.ic.ac.uk/research/barber/photosystemII.html
Predictions from Best Parameterized Model
The NPPc Portion of CASA
NPPc = Smonth max (E · IPAR, 0)
E = 0.56 · T1 · T2 · W
T1 = 0.8 + 0.02 · Topt – 0.0005 · Topt2
T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10) ) · (1 + e 0.3 · (Tempc – Topt – 10) )]
W = 0.5 + 0.5 · EET / PET
PET = 1.6 · (10 · Tempc / AHI)A · PET-TW-M if Tempc > 0
PET = 0 if Tempc < 0
A = 0.00000068 · AHI3 – 0.000077 · AHI2 + 0.018 · AHI + 0.49
IPAR = 0.5 · FPAR-FAS · Monthly-Solar · Sol-Conver
FPAR-FAS = min [(SR-FAS – 1.08) / SR (UMD-VEG) , 0.95]
SR-FAS = (Mon-FAS-NDVI + 1000) / (Mon-FAS-NDVI – 1000)