Computational Discovery of Communicable Knowledge
Download
Report
Transcript Computational Discovery of Communicable Knowledge
Lesssons for the Computational
Discovery of Scientific Knowledge
Pat Langley
Institute for the Study of Learning and Expertise
Palo Alto, California
and
Center for the Study of Language and Information
Stanford University, Stanford, California
http://www.isle.org/~langley
[email protected]
Thanks to S. Bay, V. Brooks, S. Klooster, A. Pohorille, C. Potter, K. Saito, J. Shrager,
M. Schwabacher, and A. Torregrosa.
Outline of the Talk
1. History of machine learning applications
2. Traditional lessons from applied machine learning
3. History of computational scientific discovery
4. Two application efforts in scientific discovery
5. Lessons from these application efforts
6. Directions for future research
History of Machine Learning Applications
Early 1980s: D. Michie et al. champion use of decision-tree
induction on industrial problems.
During 1980s: Parallel application developments in neural
networks and case-based learning.
Early 1990s: Initial reviews of machine learning applications.
Mid 1993: First workshops on applications of machine learning.
Mid 1995: CACM paper analyzes factors underlying success.
Mid 1995: KDD conference becomes the default meeting for
papers on machine learning applications.
Early 1998: Special issue of Machine Learning, with editorial,
on applications.
Steps in the Application of Machine Learning
Formulating
the Problem
Engineering the
Representation
Collecting and
Preparing Data
Induction
Process
Evaluating the
Learned Knowledge
Gaining User
Acceptance
Areas of Machine Learning Applications
There exist a number of application movements within the field of
machine learning:
data mining for classification/regression tasks
empirical natural language processing
applied reinforcement learning
adaptive interfaces for personalized services
computational scientific discovery
These types of applications differ in the demands they make and in
the issues they raise.
Data Mining vs. Scientific Discovery
There exist two computational paradigms for discovering explicit
knowledge from data:
Data mining generates knowledge cast as decision trees,
logical rules, or other notations invented by AI researchers;
Computational scientific discovery instead uses equations,
structural models, reaction pathways, or other formalisms
invented by scientists and engineers.
Both approaches draw on heuristic search to find regularities in
data, but they differ considerably in their emphases.
History of Research on
Computational Scientific Discovery
1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
Abacus,
Coper
Bacon.1–Bacon.5
AM
Glauber
Dendral
Dalton,
Stahl
Legend
Hume,
ARC
DST, GPN
LaGrange
IDSQ,
Live
NGlauber
Stahlp,
Revolver
IE
Numeric laws
Fahrehneit, E*,
Tetrad, IDSN
Gell-Mann
BR-3,
Mendel
RL, Progol
Pauli
Coast, Phineas,
AbE, Kekada
Qualitative laws
SDS
HR
BR-4
Mechem, CDP
Structural models
SSF, RF5,
LaGramge
Process models
Astra,
GPM
Successes of Computational Scientific Discovery
Over the past decade, systems of this type have helped discover
new knowledge in many scientific fields:
• stellar taxonomies from infrared spectra (Cheeseman et al., 1989)
• qualitative chemical factors in mutagenesis (King et al., 1996)
• quantitative laws of metallic behavior (Sleeman et al., 1997)
• qualitative conjectures in number theory (Colton et al., 2000)
• temporal laws of ecological behavior (Todorovski et al., 2000)
• reaction pathways in catalytic chemistry (Valdes-Perez, 1994, 1997)
Each of these has led to publications in the refereed literature of
the relevant scientific field (see Langley, 2000).
Steps in Applying Computational Scientific Discovery
problem
formulation
algorithm
manipulation
algorithm
invocation
representation
engineering
data collection/
manipulation
filtering and
interpretation
Two Applications for Scientific Discovery
Given
Find
Data on climate variables
and carbon production
over space and time
A model of the Earth’s
ecosystem that fits and
explains these data
Given
Find
Gene expression levels,
over time, for wild
and mutant organisms.
A model of gene
regulation that fits and
explains these data
Lesson 1
Traditional notations from machine learning are not communicated
easily to domain scientists.
Ecosystem model
NPPc = Smonth max (E · IPAR, 0)
E = 0.56 · T1 · T2 · W
T1 = 0.8 + 0.02 · Topt – 0.0005 · Topt2
T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10) ) · (1 + e 0.3 · (Tempc – Topt – 10) )]
W = 0.5 + 0.5 · EET / PET
PET = 1.6 · (10 · Tempc / AHI)A · PET-TW-M if Tempc > 0
PET = 0 if Tempc < 0
A = 0.00000068 · AHI3 – 0.000077 · AHI2 + 0.018 · AHI + 0.49
IPAR = 0.5 · FPAR-FAS · Monthly-Solar · Sol-Conver
FPAR-FAS = min [(SR-FAS – 1.08) / SR (UMD-VEG) , 0.95]
SR-FAS = (Mon-FAS-NDVI + 1000) / (Mon-FAS-NDVI – 1000)
Gene regulation model
NBLR
+
+
NBLA
psbA1
-
+
RR
-
Health
+
-
psbA2
Light
PBS
-
DFR
+
-
cpcB
+
Photo
Lesson 2
Scientists often have initial models that should influence the
discovery process.
NBLR
+
+
NBLA
psbA1
+
+
RR
-
Observations
psbA2
+
NBLR
+
psbA1
-
m
+
NBLA
PBS
RR
×
Health
+
-
psbA2
Light
-
+
DFR
Revised model
cpcB
Initial model
+
cpcB
×
Photo
Health
+
Light
PBS
-
DFR
Discovery
-
Photo
Lesson 3
Scientific data are often rare and difficult to obtain rather than
being plentiful.
Ecosystem model
Number of variables
Number of equations
Number of parameters
Number of samples
Gene regulation model
8
11
20
303
9
Number of variables
11
Number of initial links
Number of possible links 70
20
Number of samples
Lesson 4
Scientists want models that move beyond description to provide
explanations of their data.
Ecosystem model
Gene regulation model
NPPc
NBLR
+
+
E
NBLA
psbA1
-
W
T2
T1
SOLAR
FPAR
+
+
A
PET
EET
Topt
SR
PETTWM
Tempc
NDVI
RR
-
VEG
Health
+
-
psbA2
Light
AHI
PBS
IPAR
DFR
e_max
-
cpcB
+
Photo
Lesson 5
Scientists want computational assistance rather than automated
discovery systems.
NBLR
+
+
NBLA
psbA1
+
+
RR
-
Observations
psbA2
+
cpcB
Initial model
NBLR
+
+
NBLA
psbA1
-
+
RR
×
Health
+
-
psbA2
Light
PBS
+
DFR
Revised model
-
cpcB
×
Photo
Health
+
Light
PBS
-
DFR
Discovery
-
Photo
An Environment for Interactive Modeling
In response, we are developing an environment that lets users:
specify process models of static and dynamic systems;
display and edit a model’s structure and details graphically;
utilize a model to simulate a system’s behavior over time;
incorporate background knowledge cast as generic processes;
indicate which processes to consider during model revision;
invoke a revision module that improves a model’s fit to data.
The current environment focuses on quantitative processes, but
future versions will also support qualitative models.
A Process Model for Carbon Production
model npp;
variables NPPc, E, IPAR, T1, T2, W, Topt, tempc, eet, PET, PETTWM,
ahi, A, FPARFAS, monthlySolar, SolConver, MONFASNDVI, umd_veg;
observable ahi,eet,tempc,Topt,MONFASNDVI,monthlySolar,PETTWM,umd_veg;
process CarbonProd;
equations NPPc = E * IPAR;
process PhotoEfficiency;
equations E = (0.389 * (T1 * (T2 * W)));
process TempStress1;
equations T1 = (0.8 + ((0.02 * Topt) - (0.0005 * (Topt ^ 2))));
process TempStress2;
equations T2 = ((1.1814 /
(1 + (2.718281828 ^ (0.2 * (Topt - 10 - tempc))))) /
(1 + (2.718281828 ^ (0.3 * (tempc - 10 - Topt)))));
process WaterStress;
conditions PET!=0;
equations W = (0.5 + (0.5 * (eet / PET)));
process WSNoEvapoTrans;
conditions PET==0;
equations W = 0.5;
process EvapoTrans;
conditions tempc>0;
equations PET = 1.6 * (10 * tempc / ahi) ^ A * PETTWM;
•
•
•
Viewing and Editing a Process Model
Directions for Future Research
These lessons suggest the field needs increased research on:
methods for discovering knowledge in scientific formalisms
techniques for revising existing scientific models
approaches to dealing with small data sets
algorithms for discovering explanatory models
interactive environments for scientific knowledge discovery
Taken together, these emphases should address the needs of domain
scientists and produce interesting new methods.
In Memoriam
Early last year, computational scientific discovery lost two of its
founding fathers:
Herbert A. Simon (1916 – 2001)
Jan M. Zytkow (1945 – 2001)
Both contributed to the field in many ways: posing new problems,
inventing methods, training students, and organizing meetings.
Moreover, both were interdisciplinary researchers who contributed
to computer science, psychology, philosophy, and statistics.
Herb Simon and Jan Zytkow were excellent role models that we
should all aim to emulate.
The NPPc Portion of CASA
NPPc = Smonth max (E · IPAR, 0)
E = 0.56 · T1 · T2 · W
T1 = 0.8 + 0.02 · Topt – 0.0005 · Topt2
T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10) ) · (1 + e 0.3 · (Tempc – Topt – 10) )]
W = 0.5 + 0.5 · EET / PET
PET = 1.6 · (10 · Tempc / AHI)A · PET-TW-M if Tempc > 0
PET = 0 if Tempc < 0
A = 0.00000068 · AHI3 – 0.000077 · AHI2 + 0.018 · AHI + 0.49
IPAR = 0.5 · FPAR-FAS · Monthly-Solar · Sol-Conver
FPAR-FAS = min [(SR-FAS – 1.08) / SR (UMD-VEG) , 0.95]
SR-FAS = (Mon-FAS-NDVI + 1000) / (Mon-FAS-NDVI – 1000)
The NPPc Portion of CASA
NPPc
E
e_max
W
A
PET
AHI
PETTWM
IPAR
T2
EET
Tempc
T1
SOLAR
Topt
SR
NDVI
FPAR
VEG
A Model of Photosynthesis Regulation
How do plants modify their photosynthetic apparatus in high light?
NBLR
+
NBLA
-
PBS
+
-
DFR
psbA1
-
+
+
psbA2
Light
+
-
-
RR
Health
cpcB
+
Photo