Computational Discovery of Communicable Knowledge

Download Report

Transcript Computational Discovery of Communicable Knowledge

Challenges for the Computational
Discovery of Scientific Knowledge
Pat Langley
School of Computing and Informatics
Arizona State University
Tempe, Arizona
Institute for the Study of Learning and Expertise
Palo Alto, California
Thanks to K. Arrigo, D. Billman, M. Bravo, S. Borrett, W. Bridewell, S. Dzeroski, and
L. Todorovski for their contributions to this research, which is funded by a grant from
the National Science Foundation.
Drawbacks of Scientific Data Mining
Because it borrows from work on commercial applications, most
work on scientific data mining:
 generates models in forms inappropriate to most sciences
 makes incorrect assumptions about the available inputs
 focuses on convenient algorithmic issues, not scientists’ needs
We need to redirect attention toward a broader range of discovery
tasks that actually arise in scientific fields.
Data-mining researchers would benefit from looking at the older
literature on computational scientific discovery.
Claim 1: Scientific Notations
Traditional data-mining notations are not easily understood by or
communicated to domain scientists.
Most sciences state and communicate models in formalisms they
have used for decades.
We need more work on discovering scientific knowledge cast in
communicable forms (Dzeroski & Todorovski, 2007).
Ecosystem model
NPPc = Smonth max (E · IPAR, 0)
E = 0.56 · T1 · T2 · W
T1 = 0.8 + 0.02 · Topt – 0.0005 · Topt2
T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10) ) · (1 + e 0.3 · (Tempc – Topt – 10) )]
W = 0.5 + 0.5 · EET / PET
PET = 1.6 · (10 · Tempc / AHI)A · PET-TW-M if Tempc > 0
PET = 0 if Tempc < 0
A = 0.00000068 · AHI3 – 0.000077 · AHI2 + 0.018 · AHI + 0.49
IPAR = 0.5 · FPAR-FAS · Monthly-Solar · Sol-Conver
FPAR-FAS = min [(SR-FAS – 1.08) / SR (UMD-VEG) , 0.95]
SR-FAS = (Mon-FAS-NDVI + 1000) / (Mon-FAS-NDVI – 1000)
Gene regulation model
NBLR
+
+
DFR
NBLA
psbA1
+
+
Light
RR
-
-
PBS
Health
+
-
psbA2
cpcB
+
Photo
Claim 2: Background Knowledge
Scientists often have initial knowledge that should influence the
discovery process.
Ignoring this knowledge can produce models that scientists reject
as nonsensical (Pazzani et al., 2001).
Observations
NBLR
+
+
Model
Revision
NBLR
+
+
DFR
psbA1
+
+
Light
NBLA
RR
-
-
PBS
cpcB
+
Light
Health
RR
×
-
psbA2
PBS
+
Health
+
×
Photo
cpcB
Revised model
Photo
+
-
psbA1
-
-
-
+
psbA2
DFR
NBLA
Initial model
Claim 3: Small Data Sets
Most data-mining work assumes that large data sets are available.
But in many scientific domains, data are rare and hard to obtain.
Discovering scientific knowledge from small data sets raises an
entirely different set of challenges (Lee et al., 1998).
We need more research on this important aspect of discovery.
Ecosystem model
Number of variables
Number of equations
Number of parameters
Number of samples
Gene regulation model
8
11
20
303
Number of variables
Number of initial links
Number of possible links
Number of samples
9
11
70
20
Claim 4: Scientific Explanation
Most work on data mining finds models that, although accurate,
merely describe the observations.
However, scientists often want models that explain their data using
familiar concepts.
Explanatory models can include theoretical entities and processes
that link back to domain knowledge (Langley et al., 2002).
Ecosystem model
Gene regulation model
NPPc
NBLR
E
+
+
IPAR
W
T2
T1
SOLAR
psbA1
-
FPAR
+
+
A
PET
EET
Topt
RR
-
psbA2
AHI
PETTWM
Tempc
NDVI
VEG
PBS
Health
+
-
SR
Light
-
-
DFR
e_max
NBLA
cpcB
+
Photo
Claim 5: Interactive Discovery
Most data-mining work focused on entirely automated algorithms.
But most scientists want computational aids rather than systems that
would replace them.
We need more work on interactive discovery (Bridewell et al., 2007).
Domain user
NBLR
+
+
Model
Revision
Observations
NBLR
+
+
DFR
psbA1
+
+
Light
NBLA
RR
-
-
PBS
cpcB
+
Light
Health
RR
×
-
psbA2
PBS
+
Health
+
×
Photo
cpcB
Revised model
Photo
+
-
psbA1
-
-
-
+
psbA2
DFR
NBLA
Initial model
The PROMETHEUS System
(Bridewell et al., 2007)